Arrays and Lists in SQL Server
Data from Performance Tests in SQL 2005

An SQL text by Erland Sommarskog, SQL Server MVP Latest revision 2010-10-10. Copyright applies to this text.

This is an appendix to the article Arrays and Lists in SQL Server 2005 and Beyond. In this appendix, I present data and analyses from the of performance tests I ran in 2006 for the various list-to-table methods. If you have arrived at this page from a web search, you may want to at least browse the main article first. There is a second appendix with tests that I ran in 2009 on SQL 2008, which covers many of the methods tested here as well, as well new methods that I added to the article after complete the original test suite, including table-value parameters, a new featyre in SQL 2008. The two appendixes are differently structured, and you need to read both to get a complete pictures. There are observations that I make only in this appendix and vice versa.

Table of Contents:

   General Disclaimers
   The Contenders
   How the Tests Were Carried Out
      The Test Table
      The Test Data
      Tested Operations
      Sample Test Procedures
      The Test Script
      Used Hardware
   The Results
      List with 20 Elements
      List with 200 Elements
      List with 650 Elements
      List with 2000 Elements
      List with 10000 Elements
      Conclusion
   Special Observations
      To Chunk or Not to Chunk
      Inline, Multi-statement and Temp Tables
      The Collation Issue
      Fixed-Length and the String-Length Threshold
      FIX$BINARY
      CLR
      XML
      Both varchar and nvarchar in the Return Table
      Variations of inline_split_me
      Inserting Many Values
      Dynamic SQL
   All Methods in the Test
   Revisions

General Disclaimers

I decided to keep these tests fairly simple. This means:

This means that these while these tests give us some information, there are also pieces missing. The numbers should be informative enough to tell us how fast or slow the methods as such are. But they do not tell us how they will work with your tables, and nor do they say much about how they will work under load.

Thus, if you need top performance, you may need to conduct your own tests with your tables. Maybe you need to include the client-side time it takes to compose XML documents and the extra network cost for XML or fixed-length if you are considering these methods. Maybe you need to conduct testing with many parallel processes, so that you see what happens when SQL Server comes under load. And real-world application tables are not likely to fit into cache.

I like to stress that you cannot use the results in this article and the old article for SQL 2000 to compare SQL 2005 to SQL 2000. There are important differences in the test setup that makes any such comparison impossible in general. The only exception is for dynamic SQL where the difference is drastic for large input.

Note: for the SQL 2000 tests, I made the entire test kit available in a zip file, but I have opted not to so this time. The test database is much larger, and a zip file of it all is almost 50 MB. All stored procedures and the test script are available through links in this appendix. Please contact me, if you also want the database.

The Contenders

As I have assembled material for this article, I have tested far too many methods to include data for all of them in the main body of this text. All methods that I have tested are derived from the methods I discuss in the main article. Some methods are important modifications to the main theme, whereas others are just small variations, like the same algorithm both in a inline function and a multi-statement function. For many of these small variations I first ran a quick test, and if that quick test did not reveal anything significant, I did not save that data. Eventually, 35 "methods" made it to the finals, run on three different machines.

From this, I have selected twelve methods for which I will present the results in detail. In the section Special Observations, I highlight various things of interest, and most other methods appear here. The full data for all 35 methods on all three test machines are available in text files. There are links to these files further down in this text.

I have given each method that I've tested a name, and I will use these names as a convenient short-hand notation, mixed with the name of the functions as they appeared in the main article. Throughout the appendix, these names appear as links and if you click the link, you will be brought to page with the test procedures, which in their turn have links to tested functions. Thus, in this way you can see exactly what I have tested.

Here are the twelve main competitors:

ITER$CHUNKThe iterative functions iter_intlist_to_tbl and iter_charlist_to_tbl.
CLR$ITERThe CLR functions CLR_intlist_iter and CLR_charlist_iter.
XMLATTRUsing nodes and value with attribute-centred XML.
TBLNUM$IL A table with numbers: inline_split_me.
TBLNUM A table of numbers: the function text_split_me which is the same as duo_text_split_me in the main article, save that it only returns an nvarchar column.
FIX$SINGLE Fixed-length list elements: the function fixstring_single, joining once with the table of numbers.
FIX$MULTI Fixed-length list elements: fixstring_multi, that joins the table of numbers twice to make it virtually unlimited.
FIX$MSTMTFixed-length list elements: a multi-statement version of fixstring_single.
CTE$ILRecursive CTE, the function cte_split_inline.
EXEC$A Dynamic SQL.
MANYSELECT Making the list into a SELECT, the procedures unpack_with_manyselect and unpackstr_with_manyselect.
REALSLOW A really slow method, using charindex to find the list elements.

The full list of all tested methods is at the end of this article.

How the Tests Were Carried Out

The Test Table

This is the table I have used for all tests.

CREATE TABLE usrdictwords
  (wordno int          NOT NULL,
   word   nvarchar(50) NOT NULL,
   guid   char(36)     NOT NULL)
CREATE UNIQUE CLUSTERED INDEX wordno_ix ON usrdictwords(wordno)
CREATE UNIQUE INDEX word_ix ON usrdictwords(word)

The table contains 1 163 750 rows, and the average length of the column word is 10.6 characters. The longest word is 30 characters. wordno holds numbers in the range from 0 to 1 163 749. The order of wordno is not correlated with the order of word. The column guid serves to make the table a little wider, and to be a token non-indexed column. The data in guid is not used for anything in the tests.

The collation for the database was Slovenian_CS_AS. (The data is a taken from a Slovenian version of the Unix file /usr/dict/words, used for the spell utility, that I culled off the net somewhere.)

Note: the test table for SQL 2000 had the same schema, except that I used varchar in those tests, The volume in the SQL 2000 was a mere 200 000 words, despite the source that time too was a spelling dictionary for Slovenian. Somehow, Slovenian has expanded its vocabulary with almost a factor six in just a few years!

The Test Data

For each test, my test script randomly selected rows from usrdictwords, to construct one comma-separated list of strings and one list of integers from the columns word and wordno respectively. After each list item, the script randomly added 0-3 spaces between the list elements. (The reason for this is that the test script also served to validate that the methods yielded the correct result.) For fixed-length and XML the list was transformed to the respective formats. For the really slow methods, the random spacing was removed, as they are not able to cope with it.

I ran the test with five different list sizes: 20, 200, 650, 2 000 and 10 000 elements in the list. The basic strategy was to run each method 100 times. However, to avoid that slow methods kept the script running for days, the script applied a cut-off strategy, so that when a method had used a total of 15 minutes, it was not permitted to run more on that list size. It was still given the chance to run two runs on the next list size, as long it did not exceed the next limit of 75 minutes.

Tested Operations

I have tested two types of operations:

UNPACKUnpacking the input list itself into a result set, without involving another table. This test is not possible to carry out with dynamic SQL and the really slow methods.
JOINUsing the input list to extract data from the test table. With most methods, this means a join operation, and I denote this operation as JOIN also for a method like dynamic SQL even if there is no actual join taking place.

Sample Test Procedures

Here is a sample of a typical UNPACK procedure:

CREATE PROCEDURE TBLNUM_Int_UNPACK_test
                 @str    nvarchar(MAX),
                 @tookms int OUTPUT AS

DECLARE @start datetime
SELECT @start = getdate()

INSERT Int_UNPACK(number)
SELECT number = convert(int, str)
FROM   chunk_split_me(@str, DEFAULT)

SELECT @tookms = datediff(ms, @start, getdate());

SELECT number FROM Int_UNPACK
TRUNCATE TABLE Int_UNPACK

The procedure: 1) Accepts an nvarchar(MAX) parameter. 2) Starts a timer. 3) Issues the query, inserting the result set into a permanent, empty, table. 4) Stops the timer. 5) Returns the data to the test script, which checks that the procedure returned the expected data. 6) Truncates the table after the run.

I insert data into a table, since had I returned data to the client, the measured values would have been distorted by random factors not related to the tested methods. (You may wonder why I used INSERT and not SELECT INTO, which I did in the tests for SQL 2000. I had SELECT INTO, but I found that got random values that were several hundred ms higher than most other values for a certain test. I suspected that the reason was auto-stats on the system tables, which you cannot turn off. And, indeed, switching to INSERT lead to timings being more consistent.)

The test procedures for XML accept a parameter of the xml data type. If there is a larger overhead for receiving an xml parameter, this is not captured in the test. I discuss this problem in further detail in the section on XML under Special Observations.

I also give an example of a JOIN procedure to highlight one important detail with the string join:

CREATE PROCEDURE CLR$ITER_Str_JOIN_test
                 @str nvarchar(MAX),
                 @tookms int OUTPUT AS

DECLARE @start datetime
SELECT @start = getdate()

INSERT Str_JOIN(wordno, guid)
SELECT u.wordno, u.guid
FROM   usrdictwords u
JOIN   CLR_charlist_iter(@str, DEFAULT) AS a on u.word = a.str

SELECT @tookms = datediff(ms, @start, getdate());

SELECT wordno FROM Str_JOIN
TRUNCATE TABLE Str_JOIN

Note here that when joining over the string column in the test table, I am also retrieving the column guid which is not in any index. This forces SQL Server to access the data pages of the table. (Otherwise the query would be covered by the index on word.) This gives the optimizer the two choices: using the non-clustered index on word with bookmark lookups or scanning the table.

For most methods, there have been four procedures: Str_UNPACK, Int_UNPACK, Str_JOIN and Int_JOIN. You can retrieve the test procedures for a certain method by clicking on the method name as it appears in the article.

The Test Script

The test script is written in Perl, connecting to SQL Server using the SQL Native Client OLE DB provider.

The script creates the database with a size of 195 MB for the data file and 100 MB for the log file (the initial size is 750 MB during the data load, but the script shrinks it before running the tests). The intention was that the test procedures should be run with simple recovery, but they were in fact run with bulk-logged recovery. tempdb is not controlled by the script, but I made sure that tempdb had at least 50 MB for both data and log.

The test script runs with the default SET options you get with OLE DB, except that it also issues SET NOCOUNT ON.

When running the tests, the test script creates lists of strings and integers as described above, and then passes the assembled lists to all test procedures currently in the test database in alphabetic order. That is, all methods see the same lists. The script first runs all tests for lists with 20 elements, then all for 200 elements etc. When starting with a new list length, the script performs an sp_recompile on the usrdictwords table to flush all plans. It then runs a test 0 for that length, which is not included in the results, so that the cost for compilation does not cause distortions.

The test script uses a fixed seed for the random generator, so different runs of the script generate the same test data. This permitted to me test a method separately from the rest, yet using exactly the same data as for the others.

The script starts a transaction before running a test procedure and commits it after the run.

The script is able to call the test procedures through RPC or by sending an EXEC commands, but in the final test runs I ran all tests with RPC calls. (The choice of calling method should reasonably not have any importance, but during the SQL 2000 tests, fixed-length methods behaved very differently depending of the call method for some list sizes. In my preliminary tests I was not able to repeat this for SQL 2005.)

You can view the complete test script here.

Used Hardware

I ran the tests on three different machines:

JAMIE4K This is a server with 4 Pentium III, 550 GHz, 3 GB of RAM, disk arrays. Windows 2000 SP4. This server was also used for the SQL 2000 tests.
KESÄMETSÄThis is my workstation at home, with a HT Pentium 4 at 2.8 GHz, 1.5 GB of RAM. Windows XP SP2. IDE disks. (The same machine appeared in the SQL 2000 tests, but I have replaced most of the hardware have since then.)
SOMMERSKOVAnother machine at home with an AMD Athlon X2 4200, 2.2 GHz, dual core. 2 GB of RAM. Windows XP x64 SP1. SATA disks.

When running the tests for KESÄMETSÄ and SOMMERSKOV, I ran the test scripts locally. When running the tests for JAMIE4K, I ran the script from my workstation at work. (JAMIE4K is one of our development servers.)

In the main presentation of the results in this article, I only show the results from JAMIE4K for the shorter lists length. I show data for KESÄMETSÄ and SOMMERSKOV for the tests with 2 000 and 10 000 elements. There are several reasons for this. One is that of the test machines, JAMIE4K is the only one that resembles a production server. (In fact, it has a history in the dim and distant past as a production box.) Its CPU specs may seem humble, but that is actually an advantage here. SQL Server has a resolution of 3.33 ms for datetime, and in my experience it's impossible when measuring with getdate to get a result greater than zero, but lower than 13 ms. On SOMMERSKOV and KESÄMETSÄ, there are plentiful of runs on 0 ms, even for 650 elements, whereas JAMIE4K only has a handful for 200 elements, and none at 650 elements.

I should also note that on KESÄMETSÄ, I ran the tests with hyper-threading enabled, and with SQL Server configured to have both logical CPUs available. Normally, you disable hyper-threading when you run SQL Server on hyper-threaded CPUs, or at least you set the configuration option max degree of parallelism to the number of physical CPUs (or CPU cores), since parallel plans do not benefit from hyper-threading. (As I understand it, hyper-threading is derived by using free cycles in the CPU while waiting for data to come in from memory. But with parallel queries, data comes in all the time, and there are no free cycles, so the threads intended to be parallel are instead serialised.)

The Results

Here I present average execution times for the twelve main contenders on JAMIE4K for all lists length, and for the two biggest list sizes, I also include their results on KESÄMETSÄ and SOMMERSKOV. To see the full results for all 35 methods for each machine with average, minimum, maximum and median execution times, standard deviation and variation coefficient, here is one link each for JAMIE4K, KESÄMETSÄ and SOMMERSKOV The files are grouped by list size, data type and operation. Within each group, sorting is by average execution time. At the end of the article there is a brief description of all methods.

This section has a lot of numbers, too many numbers the reader may feel after a while. I am fairly brief in explaining the reason for results that may be unexpected, and instead defer those to the section Special Observations. This is because the explanations are often more relevant than the numbers as they may reveal things that I was not able to capture in the tests.

All numbers are derived from SQL 2005 SP1, unless I say otherwise.

All timings are in milliseconds.

List with 20 Elements

Method # runs Str_UNPACK Int_UNPACK Str_JOIN Int_JOIN
ITER$CHUNK 100 8 7 10 9
CLR$ITER 100 3 2 5 5
XMLATTR 100 4 3 5 4
TBLNUM$IL 100 6 4 18 13
TBLNUM 100 6 5 8 7
FIX$SINGLE 100 1 2 7 4
FIX$MULTI 100 13 11 17 13
FIX$MSTMT 100 5 4 7 6
CTE$IL 100 5 6 9 8
EXEC$A 100     33 13
MANYSELECT 100 14 9 17 12
REALSLOW 12     66 712 7 515

For this list size all methods are very fast, but one: REALSLOW, which is just that: really slow. It does a lot better on integer lists, since in this case it is possible to force a binary collation. But it's still far behind everything else.

For the other methods, the timings are so low, that one should be very careful to draw any conclusions from the numbers above. Dynamic SQL lags behind a bit, and we can note that FIX$MULTI has two-digit numbers for all operations, and so has TBLNUM$IL when joining. You may recall from the main article, that in all these six cases, there is a parallel query plan. While the accuracy is poor, it's still an indication that for short lists, there parallelism buys you an overhead.

List with 200 Elements

From this list length I include the ranking for each group. You see this as a grey italic number to the right in each cell.

Method # runs Str_UNPACK Int_UNPACK        Str_JOIN       Int_JOIN
ITER$CHUNK 100 58 9 59 10 81 9 67 10
CLR$ITER 100 19 2 19 2 40 2 27 2
XMLATTR 100 26 4 26 4 48 4 35 5
TBLNUM$IL 100 42 7 33 7 76 8 54 8
TBLNUM 100 28 6 26 4 49 5 34 4
FIX$SINGLE 100 13 1 13 1 37 1 22 1
FIX$MULTI 100 23 3 23 3 47 3 33 3
FIX$MSTMT 100 27 5 26 4 49 5 35 5
CTE$IL 100 44 8 43 8 67 7 53 7
EXEC$A 100     203 11 112 11
MANYSELECT 100 93 10 58 9 115 10 66 9
REALSLOW 2     663 518 12 21 538 12

The field is starting to spread out and we can identify a group of faster methods consisting of the CLR, XML, TBLNUM and the three fixed-length methods of which FIX$SINGLE is leading the pack. There is also a group of slower methods with the CTE, TBLNUM$IL, the iterative method, MANYSELECT and dynamic SQL, roughly in that order.

REALSLOW is getting completely out of hand.

List with 650 Elements

Method # runs Str_UNPACK Int_UNPACK        Str_JOIN       Int_JOIN
ITER$CHUNK 100 185 9 180 10 253 9 217 10
CLR$ITER 100 67 3 62 3 133 3 90 3
XMLATTR 100 88 6 85 6 159 6 116 6
TBLNUM$IL 100 156 8 123 7 238 8 147 7
TBLNUM 100 85 5 80 4 153 5 106 4
FIX$SINGLE 100 47 1 46 1 119 2 75 2
FIX$MULTI 100 58 2 57 2 118 1 69 1
FIX$MSTMT 100 84 4 83 5 151 4 107 5
CTE$IL 100 155 7 153 8 226 7 181 8
EXEC$A 100     634 11 369 11
MANYSELECT 100 300 10 179 9 377 10 207 9

REALSLOW is not in the list any more, but it was taken off the racing course by the test script after the runs for 200 list elements.

We can start to discern an order in the fast group as well, with the two inline functions for fixed-length taking the lead, ahead of the CLR, while the other three in this group that still are fairly close together. Notice that the simple inline function for fixed-length is faster for unpack, but when joining, the complex function is the quickest. The parallel plan is starting to pay off.

In the slower bunch, the order is the same as for 200 list elements with one exception: TBLNUM$IL is now clearly ahead of the CTE for integer operations.

List with 2000 Elements

Method # runs Str_UNPACK Int_UNPACK        Str_JOIN       Int_JOIN
ITER$CHUNK 100 565 9 571 10 775 9 657 10
CLR$ITER 100 205 3 188 3 413 3 276 3
XMLATTR 100 272 6 260 6 499 6 351 6
TBLNUM$IL 100 486 8 378 7 519 7 418 7
TBLNUM 100 257 4 235 4 464 4 318 4
FIX$SINGLE 100 145 1 136 1 364 2 230 2
FIX$MULTI 100 162 2 154 2 333 1 167 1
FIX$MSTMT 100 257 4 247 5 466 5 328 5
CTE$IL 100 480 7 472 8 693 8 567 8
EXEC$A 100     2 130 11 1 304 11
MANYSELECT 100 1 186 10 547 9 1 404 10 635 9

The trend from the previous list size remains: In the top we find the two fixed-length inline functions, with FIX$SINGLE somewhat faster for unpack. But FIX$MULTI is now distinctively faster for the join operation over the integer column. The CLR is very clearly number three. Of the other three faster methods, TBLNUM is faster for the integer operation, whereas the multi-statement fixed-length function wins on the strings. XML is slightly behind these two.

In the slow group TBLNUM$IL is distancing itself even more from the CTE and for the join operations it is now closer to XML. Dynamic SQL is lagging more and more behind, more or a less in a group of its own now.

Up to 650 list elements, the execution times on the other two test machines are too low to be accurate. But with 2 000 list elements, it's worth looking at these as well. Here are the results from KESÄMETSÄ, the other 32-bit machine:

Method # runs Str_UNPACK Int_UNPACK        Str_JOIN       Int_JOIN
ITER$CHUNK 100 125 9 130 10 190 8 151 10
CLR$ITER 100 42 3 41 3 104 3 62 3
XMLATTR 100 59 5 59 6 130 6 82 6
TBLNUM$IL 100 117 8 91 7 205 9 128 7
TBLNUM 100 59 5 54 5 122 5 73 4
FIX$SINGLE 100 30 1 29 1 95 1 51 1
FIX$MULTI 100 38 2 36 2 99 2 59 2
FIX$MSTMT 100 55 4 53 4 119 4 74 5
CTE$IL 100 107 7 107 8 171 7 131 8
EXEC$A 100     677 11 431 11
MANYSELECT 100 234 10 126 9 300 10 146 9

The outcome is similar to the results on JAMIE4K. There is one significant difference: for the join operations, FIX$SINGLE is faster than FIX$MULTI on KESÄMETSÄ, despite the complex function winning big on JAMIE4K. This is not surprising, given that KESÄMETSÄ with its single hyper-threaded CPU is not a good venue for parallel plans. Another token of this is that TBLNUM$IL is even slower than the iterative method in one case – recall that when joining TBLNUM$IL is a parallel-planner. If you look closer into the numbers, you can sense that the CLR is not too distant from fixed-length, and that XML still has contact to TBLNUM and FIX$MSTMT.

Let's look at the 64-bit machine SOMMERSKOV:

Method # runs Str_UNPACK Int_UNPACK        Str_JOIN       Int_JOIN
ITER$CHUNK 100 81 9 81 10 113 9 93 10
CLR$ITER 100 23 3 21 3 58 3 33 3
XMLATTR 100 38 6 35 6 70 6 45 6
TBLNUM$IL 100 61 7 47 7 73 7 55 7
TBLNUM 100 32 5 30 4 65 5 39 4
FIX$SINGLE 100 18 1 16 1 49 2 32 2
FIX$MULTI 100 20 2 18 2 48 1 27 1
FIX$MSTMT 100 31 4 30 4 63 4 41 5
CTE$IL 100 68 8 68 8 99 8 83 8
EXEC$A 100     499 11 333 11
MANYSELECT 100 136 10 76 9 168 10 87 9

On this machine FIX$MULTI manages to win the join operations, if not equally decisive as on JAMIE4K. (But there are only two CPU:s here) Else, the observations for the fast group are the same as on the other machines.

In the slow group, the CTE, the iterative method and MANYSELECT are very close to each other for the integer operations, unlike JAMIE4K where there were distinct differences between them. For most operations, TBLNUM$IL is closer to XML than to the CTE, so it's a bit unfair to put it into the slow group here.

List with 10000 Elements

First the numbers for JAMIE4K:

Method # runs Str_UNPACK Int_UNPACK        Str_JOIN       Int_JOIN
ITER$CHUNK 42 2 761 9 2 778 9 3 608 9 3 183 9
CLR$ITER 100 973 3 899 3 1 785 4 1 268 4
XMLATTR 100 1 312 6 1 253 6 2 229 7 1 720 7
TBLNUM$IL 93 2 384 8 1 850 7 1 469 2 977 2
TBLNUM 100 1 154 5 1 036 4 1 992 5 1 440 5
FIX$SINGLE 100 667 1 629 1 1 531 3 1 109 3
FIX$MULTI 100 701 2 661 2 1 399 1  788 1
FIX$MSTMT 100 1 152 4 1 102 5 1 997 6 1 519 6
CTE$IL 54 2 347 7 2 312 8 3 216 8 2 782 8
EXEC$A 16     13 787 11 11 271 11
MANYSELECT 17 6 471 10 2 843 10 7 343 10 4 665 10

Up from the dark! The big sensation is inline_split_me, which on the join operations suddenly appears as #2, beaten only by the complex fixed-length function. That is, the two methods that result in parallel plans are the winners. Recall that TBLNUM$IL only has a parallel plan for joining, and indeed its numbers for the unpack operations are bleak This is in difference to fixstring_multi where the parallelism is there to resolve to the cross-join of the Numbers table, so it has parallelism also for the unpack.

Apart from TBLNUM$IL's raise to the top, the main pattern from 2 000 elements remains. FIX$MULTI is still fastest for joining, and FIX$SINGLE is now at #3, clearly ahead of the CLR which in its turn is ahead of TBLNUM and FIX$MSTMT that still stay close, particularly for the string operations. XML is quite distinctively behind these two now.

In the slower bunch, there are two observations to make. One is that dynamic SQL gets out of hand. The string length increased with a factor 5, but the execution time for dynamic SQL increases a lot more than that. The other is that MANYSELECT now loses big against the iterative method on the integer join. Recall that the two methods were close on both integer operations for previous list lengths, as they still are for the unpack. Why this happens I will return to in the section Inline, Multi-statement and Temp Tables.

Let's look at KESÄMETSÄ:

Method # runs Str_UNPACK Int_UNPACK        Str_JOIN       Int_JOIN
ITER$CHUNK 100 617 9 642 9 871 8 740 9
CLR$ITER 100 208 3 198 3 464 3 294 3
XMLATTR 100 289 6 286 6 578 6 406 6
TBLNUM$IL 100 573 8 436 7 977 9 633 7
TBLNUM 100 264 5 234 4 516 5 334 4
FIX$SINGLE 100 139 1 132 1 390 1 245 1
FIX$MULTI 100 165 2 156 2 415 2 264 2
FIX$MSTMT 100 247 4 237 5 499 4 337 5
CTE$IL 100 527 7 522 8 785 7 637 8
EXEC$A 66     6 085 11 5 180 11
MANYSELECT 100 1 615 10 659 10 1 882 10 1 254 10

Compared to 2 000 elements, there no drastic differences, but compared to JAMIE4K, there is. That is, on this machine, inline_split_me, is only penalised for its parallel plan. And similarly, FIX$MULTI is behind FIX$SINGLE, although not with much. Else the results are very similar to those on JAMIE4K. CLR is clearly at #3, and TBLNUM and FIX$MSTMT are close even if with a slightly different distribution. XML falls behind the latter two, more than for 2 000 elements.

In the slower group, dynamic SQL has an increase in execution time which is far more than five-fold. And, as on JAMIE4K, MANYSELECT has a poor result for the integer join.

Here are the results for the 64-bit SOMMERSKOV, and it's time for new surprises:

Method # runs Str_UNPACK Int_UNPACK        Str_JOIN       Int_JOIN
ITER$CHUNK 100 399 9 406 10 520 9 459 9
CLR$ITER 100 109 1 99 3 230 1 142 2
XMLATTR 100 178 3 169 6 287 4 208 7
TBLNUM$IL 100 300 7 230 7 281 3 192 6
TBLNUM 100 147 2 132 5 273 2 177 4
FIX$SINGLE 100 179 4 71 1 299 6 151 3
FIX$MULTI 100 192 5 83 2 297 5 105 1
FIX$MSTMT 100 236 6 130 4 365 7 179 5
CTE$IL 100 332 8 330 8 440 8 410 8
EXEC$A 100     4 267 11 3 518 11
MANYSELECT 100 904 10 391 9 1 035 10 701 10

For the string operations, CLR is now the undisputed winner, TBLNUM comes in second, and for the join operation, TBLNUM$IL is just behind, enjoying two real CPUs for its parallel plan. Next is XML. Or to put it differently, the fixed-length method runs into a trouble and gets a drastic drop in performance. I discuss this at length in the section Fixed-Length and the String-Length Threshold.

For the integer operations, things are more familiar. The complex fixed-length function wins the integer join, A little surprise is that the CLR sneaks in as #2, but the field is tight here. On this machine too, XML falls behind TBLNUM, although it manages to keep contact for one of the join operations.

The observations for dynamic SQL and MANYSELECT are the same as on the other two machines.

Conclusion

There are some funny observations that calls for explanation, and I will return to these in the section on Special Observations.

However, I think that the most important observation one can make from these results is that up to a list length of 2 000 elements all methods - save REALSLOW – as such are good enough. But 2 000 elements is also about the upper limit where dynamic SQL is defendable from the point of view of performance. At 10 000 elements, the other methods still perform sub-second on KESÄMETSÄ and SOMMERSKOV, although MANYSELECT does this only for integers. (And, yes, only for the unpack, but it has an excuse for the join over the integers, that I will return to.) The results on JAMIE4K are certainly a lot more gloomy, but with its 550 MHz CPU:s it is nowhere representative from what you can expect from a production machine today.

Thus, as long as you don't expect your list-to-table function to run in a very busy environment, you can pick the method which you feel most comfortable with. ...with the important disclaimer that there is no way to tell from these tests how the methods will interact with your query.

Keep in mind that I ran these tests on idle machines and from a single process. So if you expect your list-to-table function run on a machine where everything else is happening at the same time, or you if you expect multiple clients to hammer the server with lists, then there is all reason to stick to the faster methods: fixed-length, table-of-numbers, the CLR or XML. And you should probably avoid the methods that get their speed through parallel plans (after all, they are consuming more resources). If you have the need for absolute top performance, there is all reason to conduct your own tests.

The same goes if you need to handle even longer lists, than I've run my tests over. As you can sense there are some thresholds. See how at 10 000 elements, inline_split_me rose to glory with 4 CPU:s and how fixed-length collapsed on 64-bit. Surely, there are more surprises that I for one reason or another did not capture in my tests.

Special Observations

In this section I go into deeper details of the test results, and look at methods that are variations of main contenders to highlight important observations, and I will try to explain why some things happen the way they do.

To Chunk or Not to Chunk

For several methods, I presented a version that sliced up the input into chunks, and said this would give better performance. You have already been able to compare TBLNUM$IL and TBLNUM in the main results, but will include those numbers again to highlight the effect. I will also look at the iterative method, introducing ITER$SIMPLE, which akin to the iter$simple_intlist_to_tbl in the beginning of the main article.

Since this is about getting the data out from the list, I focus on the numbers for the unpack operation. I only present data for JAMIE4K as the result is the same on all machines.

List length Str_UNPACK   Int_UNPACK
ITER$CHUNK ITER$SIMPLE Ratio   ITER$CHUNK ITER$SIMPLE Ratio
20 8 8 1.00   7 9 1.29
200 58 65 1.12   59 74 1.25
650 185 221 1.19   188 257 1.37
2 000 565 676 1.20   571 788 1.38
10 000 2 761 3 324 1.20   2 778 3 885 1.40
  TBLNUM TBLNUM$IL Ratio   TBLNUM TBLNUM$IL Ratio
20 6 6 1.00   5 4 0.80
200 28 42 1.50   26 33 1.27
650 85 156 1.84   80 123 1.54
2 000 257 486 1.89   235 378 1.61
10 000 1 154 2 384 2.07   1 036 1 850 1.79

You can clearly see that the chunked version is always faster than the simpler non-chunked version, with exception for the shortest list length where timings are not very accurate. You can also see that the ratio between the non-chunked version and the chunked version increases as the list size increases, and this is particularly prevalent for the table-of-numbers method. Yes, for the join operation, the development goes in the completely opposite direction for the table-of-numbers method on JAMIE4K, due to parallelism, but here I wanted to highlight the costs of working with nvarchar(MAX).

You can see that the gain from chunking is much bigger with the table-of-numbers method than with the iterative method. Both call charindex when they have has found a delimiter, but the table-of-numbers method also calls substring for every character.

For the iterative method, the penalty for the simple method is far higher for a list of numbers than for a list of strings. How come? The version of iter$simple_intlist_to_table that appears in the article uses comma as delimiter, but ITER$SIMPLE calls a version with space as a delimiter, just like I used for iter_intlist_to_table, and when it finds a space it searches for the next one. For a list with a single space between each number, this would not be an issue, but the test script generates a random spacing of one to four characters between the list elements. Thus, ITER$SIMPLE has to make more calls to charindex for the integer lists.

The tests also included a chunked version of the CTE method, CTE$CHUNK. I show only the numbers for unpack of strings, as the results for unpack of integers are very similar. Instead I add the results from SOMMERSKOV for an interesting twist:

List length JAMIE4K   SOMMERSKOV
CTE$CHUNK CTE$IL Ratio   CTE$CHUNK CTE$IL Ratio
20 5 7 1.40   0 1 0.50
200 44 52 1.18   6 7 1.17
650 155 168 1.08   22 21 0.95
2 000 480 508 1.06   68 64 0.94
10 000 2 347 2 481 1.06   332 316 0.95

It may come as a surprise that the chunked version is actually somewhat slower than the non-chunked version on SOMMERSKOV, but this is a trick of the eye. Keep in mind that chunking calls for a multi-statement function, and apparently the cost for the bouncing the data over a table variable is somewhat higher than the cost for charindex on nvarchar(MAX) on SOMMERSKOV. Whereas on JAMIE4K, the relations are the other way around. I have not investigated what this may be due to, but I place my bets on that it's mere chance. In fact, in an earlier round of tests, the results were the opposite: inline was faster on JAMIE4K and chunking was faster on SOMMERSKOV!

Had I instead presented the numbers for CTE$MSTMT (a non-chunked multi-statement function) vs. CTE$CHUNK, you would have seen ratios similar to those for the iterative method, which is to be expected, since the CTE method too calls substring only when it has found a list element.

Inline, Multi-statement and Temp Tables

In the main results you could compare the inline FIX$SINGLE to a multi-statement function with the same contents as fixstring_single, used by FIX$MSTMT. Let's also throw in a method where the output from fixstring_single is bounced over a temp table, FIX$TMPTBL. I will also look a similar triplet for XML. In XMLATTR, I use nodes directly in the query; this is what I called opaque inline in the beginning of the main article. In XMLATTR$TVAR I bounce the data over a table variable and in XMLATTR$TMPTBL over a temp table.

Here are the numbers for this six methods on JAMIE4K:

Method Str_JOIN Int_JOIN
20 200 650 2 000 10 000 20 200 650 2 000 10 000
FIX$SINGLE 7 37 119 364 1 531   4 22 75 230 1 109
FIX$MSTMT 7 49 151 466 1 997   6 35 107 328 1 519
FIX$TMPTBL 5 42 138 437 1 899   5 30 95 289 2 797
XMLATTR 5 48 159 499 2 229   4 35 116 351 1 720
XMLATTR$TVAR 7 54 176 553 2 513   7 41 135 412 2 016
XMLATTR$TMPTBL 8 55 178 555 2 515   5 40 134 414 3 405

The inline methods, FIX$SINGLE and XMLATTR are always the fastest, which is "the way it should to be". (But in the world of an optimizer working with estimates, things are not always that way.) When comparing temp tables to table variables, there is, for me at least, an unexpected difference. For XML, the numbers for the temp-table and table-variable methods are very similar (except for the join over integers for 10 000 elements, which I will return to shortly). On the other hand, FIX$TMPTBL is clearly ahead of FIX$MSTMT (save that join over 10 000 integers). I don't have any final answer for why it is so. Query plans are identical, so it seems that bouncing the data over a table variable through a table-valued UDF is more expensive than bouncing over a local table variable or a temp table. (I find it difficult to believe that XML vs. fixed-length would make the difference.)

The most interesting in this table is what happens to the methods with temp tables in the join over 10 000 elements, where execution times are suddenly a lot higher than for FIX$MSTMT and XMLATTR$TVAR. The answer for this lies in the query plan. The "normal" query plan for the integer join is to scan the source table created from the list of values, and then perform a nested-loop join against usrdictwords, the test table. This is also the plan you get, when the optimizer has no particular information. But a temp table has statistics, so the optimizer gets a second chance and recompiles the SELECT statement, once the temp table has been filled, and for 10 000 elements, the plan consists of a hash join with single table scan of the test table.

As you can see from the numbers, this was not the best plan. Thus, it is a bad idea to use a temp table? Not really so. Assume that I would have run a test for 50 000 or 100 000 list elements, what would have happened then? With the methods XMLATTR$TVAR and FIX$MSTMT, the optimizer will never have any information about what is in that table variable, so it would still have gone for the nested-loops join. And at some size, the loop join would indeed be more expensive than the table scan and the hash join.

And this is not really any different for the inline methods. Looking the estimates in the query plan for FIX$SINGLE makes it very obvious that the optimizer has no clue of what is going on. Whatever the length of the input string, the optimizer estimates that we will read 300 000 rows from Numbers – this is a default assumption of 30 % of the rows. Thankfully, for some reason, the optimizer estimates that the read against the test table will only hit one row. (Had it assumed 30 %, it would not have picked nested-loops join and the result for short strings would have been embarrassing.) So no matter how long the input string is, FIX$SINGLE will always use the same query plan. The same is applies to XMLATTR – even if it has a completely different default estimate.

Before I close this section, I like to point out that we saw a token of this in the main results, when MANYSELECT suddenly fell behind the iterative method when joining over 10 000 integers. MANYSELECT is the only method of the main contenders to use a temp table.

The Collation Issue

All through the body of the main article, I had this COLLATE Slovenian_BIN2 wherever I was looking for a delimiter. I figured that I should give you an idea of how much effect this gives. So in the test suite, I included TBLNUM$NOBIN which calls a version of inline_split_me without the COLLATE clause. I show the results for the unpack of strings on JAMIE4K on SOMMERSKOV:

List length JAMIE4K   SOMMERSKOV
TBLNUM$IL TBLNUM$NOBIN Ratio   TBLNUM$IL TBLNUM$NOBIN Ratio
20 6 7 1.17   0 1  
200 42 54 1.29   6 8 1.33
650 156 191 1.22   19 25 1.32
2 000 486 600 1.23   61 80 1.31
10 000 2 384 2 969 1.25   300 396 1.32

As you see, the gains of the COLLATE clause are very obvious, even if it is not a dramatic difference. I picked the table-of-numbers method for this comparison, since no other method makes so many comparisons against the delimiter as this method. (For other methods the gain is smaller. I seem to recall that for the iterative method, the improvement was around 10 %. Then again, look at the numbers for REALSLOW to see a really drastic difference.)

Fixed-Length and the String-Length Threshold

A very puzzling observation was when fixed length was overtaken by the CLR, TBLNUM and XML on SOMMERSKOV for 10 000 elements, but only for strings. Why only for strings and why only on one machine?

Why it only happens for strings is not so difficult to understand, and neither why just fixed-length is the victim. The length of the input strings is longer for fixed length for any other method, and in my tests the fixed-element length was 9 for integers and 30 for strings, yielding a total length of 300 000 characters in the latter case. Thus, it seems reasonable to assume that we have hit a threshold in the handling of nvarchar(MAX) data type. Exactly in what sense, I don't know, but since an nvarchar(MAX) variable can fit 2 GB of data, it would not be surprising that SQL Server would handle it differently depending on size.

That leaves us with the question, why does this threshold occur only on one machine? No, it is not an issue with the query plan this time. The plans and the estimates on SOMMERSKOV and the other two machines are identical. Another question is of course: what happens if we increase the string length even further? Will execution times rise even more steeply, or will it flatten out again?

To investigate this, I composed a test that performed the unpack operations for all lengths from 60 000 characters (which corresponds to 2 000 list elements in the main test) up to 900 000 characters in steps of 60 000:

CREATE TABLE #fix (len  int NOT NULL,
                   ms   int NOT NULL)

DECLARE @len int,
        @sql nvarchar(MAX),
        @delta int

SELECT @delta = 60000
SELECT @len = @delta
WHILE @len <= 900000
BEGIN
  SELECT @sql = 'DECLARE @tookms int, @str nvarchar(MAX)
                 SELECT @str = replicate(convert(nvarchar(MAX), ''X''), ' +
                 ltrim(str(@len))  + ')
                 EXEC FIX$SINGLE_Str_UNPACK_test @str, @tookms OUTPUT
                 SELECT len(@str), @tookms'
 --PRINT @sql
  INSERT #fix(len, ms)
    EXEC(@sql)

  SELECT @len = @len + @delta
END

SELECT f1.len, f1.ms,
       msperc = convert(decimal(8, 2), 1E2*f1.ms/f2.ms - 100),
       lenprec = convert(decimal(8, 2), 1E2*f1.len/f2.len - 100)
FROM   #fix f1
LEFT   JOIN   #fix f2 ON f2.len = f1.len - @delta
ORDER BY f1.len
go
DROP TABLE #fix

FIX$SINGLE_Str_UNPACK_test is the procedure that I used to test the unpack of strings for FIX$SINGLE in the main test (with the final SELECT removed). The use of dynamic SQL is a bit clunky, but I wanted to avoid using the same nvarchar(MAX) variable again, in case that would distort the results.

What the script produces is the execution time for each length, and how much the execution time and the list length increased in percent from the previous length. Here is the output from all three test machines:

Length JAMIE4K KESÄMETSÄ SOMMERSKOV
No of chars % Chg Time % Chg Time % Chg Time % Chg
60 000   156   30   16  
120 000 100 340 118 76 153 46 188
180 000 50 466 37 93 22 46 0
240 000 33 593 27 110 18 63 37
300 000 25 733 24 123 12 186 195
360 000 20 843 15 156 27 220 18
420 000 17 1 923 128 360 131 266 21
480 000 14 2 186 14 403 12 280 5
540 000 13 2 453 12 466 16 326 16
600 000 11 2 703 10 610 31 360 10
660 000 10 3 093 14 673 10 406 13
720 000 9 3 263 6 800 19 420 3
780 000 8 3 530 8 670 -16 470 12
840 000 8 4 030 14 766 14 500 6
900 000 7 4 030 0 843 10 530 6

We can make two immediate observations:

  1. The threshold that appears between at 300 000 characters on SOMMERSKOV, appears at 420 000 characters on JAMIE4K and KESÄMETSÄ.
  2. They are indeed threshold points. After these jumps, the slope is more normal.

(There are some other minor irregularities in these numbers. Please pay no attention to them. These numbers are no averages over 100 runs, but each length was only run a single time.)

There is one more observation, that was not equally immediate to me, but when SQL Server MVP Roy Harvey saw my script, he ran it a few times on his machine, and plotted this graph in Excel:

While the slope evens out after the jump, the slope is still steeper than before the jump.

I have run this script on a couple of more machines and the threshold appeared at 420 000 characters on all computers but one where it occurred at 300 000 characters. And all machines where 32-bit boxes – except that last one which was an x64 machine. Thus, it seems to depend on the processor type where this threshold appears. As for why it would be different, I don't know. And even less do I have an idea, why the threshold would be lower on x64! (Could it be something else, like memory? It doesn't seem so. One of the 32-bit machines that I ran this on was a virtual computer running VMware, with only 256 MB of memory. The threshold was still at 420 000 characters.)

Here I'd like to insert a request: I don't have access to an IA64 machine. If you have, I would be interested in seeing the results. Follow this link to get a complete version of the script that creates a database, the Numbers table, the stored procedure and the function.

As I discussed in the main article, this problem is not confined to the fixed-length method, but all methods that use T-SQL inline functions are subject to this phenomenon, whereas methods that apply chunking largely evades the problem, and the CLR and XML methods have their own handling of the long strings outside the realm of nvarchar(MAX). (This statement was also confirmed in my new suite of tests in 2009, where I tested lists with 50 000 elements.) As for the methods that uses dynamic SQL, including MANYSELECT, I have not examined this, but it does not appear very interesting anyway.

Nevertheless, as long as the input elements vary greatly in length, the fixed-length method is the first that will hit this limit, since input strings get longer. If you want to use the fixed-length method (e.g. because your input has fixed length) you could implement a chunking version of the fixed-length method, but with bigger chunks than 4 000. Or you could implement the fixed-length method in the CLR, as I did in the CLR$FIX method, see further below.

Two final notes: 1) In this section I've talked about characters. The actual limit is of course in bytes, so if you use varchar the thresholds are at 600 000 and 840 000 characters. 2) And of course the limits are somewhere in the spans 480 000 – 600 000 bytes and 720 000 – 840 000 bytes respectively. I did not find it interesting to track down the exact threshold points. Let's say that they are at 500 000 and 750 000 bytes.

FIX$BINARY

The main body the text included a version for fixed-length that accepted a binary string. The idea was that for integers, this would be a faster method. Let's see how that worked out. I show the numbers for the unpack operation for all three machines, only for 2 000 and 10 000 list elements.

List
Length
JAMIE4K KESÄMETSÄ SOMMERSKOV
FIX$BINARY $SINGLE FIX$BINARY $SINGLE FIX$BINARY $SINGLE
2 000 134 136 28 29 15 16
10 000 613 629 125 132 70 71

The results are not conclusive, you can't say any method is better than the other. (However, in the tests in 2009, FIX$BINARY did actually prove to be faster.)

I the main article, I discussed that using binary input could save network bandwidth, but keep in mind my tests did not cover that.

(One can note that if you use binary input, you can get quite a few more elements into the list before you hit that threshold at 500 000 – 750 000 bytes that I discussed in the previous section.)

CLR

I tested three different CLR methods: CLR$SPLIT which uses the Split function in the .Net Framework, CLR$ITER where I do the iteration myself, and CLR$FIX which works with fixed-length input. I show the result from all three machines, but only for the unpack of 10 000 list elements:

List
Length
Server Str_UNPACK Int_UNPACK
CLR$SPLIT $ITER $FIX CLR$SPLIT $ITER $FIX
2 000 JAMIE4K 206 205 210 191 188 189
KESÄMETSÄ 44 42 43 42 41 41
SOMMERSKOV 23 23 24 19 21 19
10 000 JAMIE4K 993 973 1 010 912 899 903
KESÄMETSÄ 213 208 215 203 198 197
SOMMERSKOV 110 109 115 97 99 95

As you see, the differences are nowhere close to dramatic. Rather they are utterly boring. CLR$ITER appears to be slightly faster than CLR$SPLIT, but the results are anything but conclusive.

The most interesting with these numbers is what does not happen with CLR$FIX: 1) It is not faster than the other methods – apparently the penalty for searching for delimiters is much smaller in the CLR than in T-SQL. 2) CLR$FIX does not fall behind the other methods for 10 000 elements on SOMMERSKOV. That is, we can see that the threshold we saw for fixed-length in T-SQL does not apply to the CLR. (The input to CLR$FIX is the same as to FIX$SINGLE & co.)

XML

Performance-testing XML faced me with a problem. Normally, my test procedures receives an input parameter of the type nvarchar(MAX) and for all methods but XML that's uncontroversial. For XML this would be somewhat strange, because when you use XML in practice, you would of course accept a parameter of the new xml data type. Since this type is more complex, there is likely to be some overhead in receiving xml parameters. (Note that this type stores the XML document in a binary format, so full parsing is not required.) If I had started and stopped the timer in the test script before calling the stored procedure, this would not have been an issue. But since I did not want client-server communication to distort the results, I start and stop the timer in the test procedures. This left me with two options:

The only reasonable choice was to do both options to get the upper and lower boundaries for the actual cost for XML. The XMLATTR that you have already seen the results for, is the version that accepts an xml parameter. In this section, I will compare it to XMLATTR$STR, which is the same as XMLATTR except that the input is an nvarchar(MAX) parameter.

While I'm at it, we I also look at other methods using XML: XMLELEM uses nodes just like XMLATTR, but with element-centred XML. (XMLELEM takes an xml input parameter.) XMLATTR$OPEN and XMLELEM$OPEN use the old OPENXML function rather than nodes. They accept nvarchar(MAX) as input, because this is you would use OPENXML. For reference, I also include the numbers for TBLNUM$IL, "fastest of the slowest".

Here are the numbers for JAMIE4K and SOMMERSKOV for the unpack of integers; the numbers for a list of strings add nothing remarkable.

  JAMIE4K   SOMMERSKOV
Method 20 200 650 2 000 10 000   20 200 650 2 000 10 000
XMLATTR 3 26 85 260 1253   1 4 11 35 169
XMLATTR$STR 4 32 105 324 1580   2 5 17 48 239
TBLNUM$IL 4 33 123 378 1850   1 3 14 47 230
XMLELEM 3 38 126 388 1898   1 6 18 55 272
XMLATTR$OPEN 8 50 229 802 4075   1 5 28 87 439
XMLELEM$OPEN 8 54 237 826 4198   1 6 28 91 458

We can see that the overhead for assigning the input string to an xml variable inside the procedure is by no means negligible. The overhead when we convert the XML document from a string is around 25 % on JAMIE4K and as much as 40 % on SOMMERSKOV. I did not include the numbers from KESÄMETSÄ, but the overhead is around 25 % there as well, so it may be that there is a difference between 32-bit and x64 here.

But we can also see that if we put in XMLATTR$STR in place of XMLATTR in the main results, it affects the ranking of the main results only marginally. XMLATTR ranked last of the fastest method in most cases, and with the string overhead, it still keeps a distance to TBLNUM$IL on JAMIE4K (and KESÄMETSÄ). On SOMMERSKOV the game is tighter. Assuming that the true numbers lies somewhere in between, it still seems appropriate to classify XML as a fast method.

The numbers for XMLELEM tells us that element-centred XML is 50-60 % slower than attribute-centred XML. It may be surprising that the difference is that big, and it is no less surprising, to me at least, that this is due to differences in the query plans. But the really big – and nasty – surprise was what happened when I ran the tests an SP2 instance of SQL 2005 on SOMMERSKOV. For most methods this had no significant effect on the results. Except for XMLELEM

List length Str_UNPACK Int_UNPACK Str_JOIN Int_JOIN
20 3 3 3 4
200 110 118 116 125
650 984 1 029 1 007 1 060
2 000 9 164 9 525 9 139 9 585
10 000 229 641 240 101 230 359 239 953

The results are nothing but a complete disaster. What it is all about is really a serious misjudgement from the optimizer. Already on SP1, the query plan for XMLELEM is more complex than for XMLATTR. (Since XQuery processing is something I'm not well versed in, I don't go into details.) But on SP2, the optimizer adds an Eager Spool operator. Often when you see an Eager Spool operator in your query plan, you know you are in trouble.

Thankfully, this does not always happen. The test procedure inserts the resulting data from the tested statement into a table, since I don't want client communication to distort the results. When I removed the INSERT, the Eager Spool did not appear, and performance was normal. I also found that adding a clustered index on the table, prevented the Eager Spool from appearing. That is, the problem appears only if you insert into a heap. It appears that this issue has been address in SQL 2005 SP3. As for SQL 2008 please see the main article.

Finally, we can see that using OPENXML does not perform very well. While it qualifies as "good enough", it's difficult to see why you should use OPENXML as it does not buy you anything that nodes does not. Unless, that is, you run into that optimizer bug in SP2, and you are already set on using XML, for instance because your aim to insert many values.

Notes added in 2010: There were several issues in this test:

Both varchar and nvarchar in the Return Table

Some of the functions in the main article have return tables that include both a varchar and nvarchar column. In the test suite, I included both TBLNUM which returns only an nvarchar column, and TBLNUM$DUO which uses duo_chunk_split_me from the main article, and which returns both varchar and nvarchar. This table shows the results for the unpack of strings on all three servers for list length 650 and up:

Server 650 2 000 10 000
TBLNUM $DUO Ratio TBLNUM $DUO Ratio TBLNUM $DUO Ratio
JAMIE4K 85 94 1.11 257 283 1.10 1 154 1 270 1.10
KESÄMETSÄ 18 20 1.11 59 65 1.10 264 296 1.12
SOMMERSKOV 11 12 1.09 32 38 1.19 147 167 1.14

TBLNUM$DUO comes with an overhead of 10-15 %. So there is a good argument for using a function that returns only one column. But keep in mind that if you work with both varchar and nvarchar, you need a function for both and you need to remember to use the right function when you join with an indexed string column, as I discussed in the section nvarchar vs. varchar in the main article.

The penalty in percent for dual return tables is likely to be lower for slower methods like the CTE and the iterative method.

Variations of inline_split_me

The discovery that the convert of len(@param) to int had so dramatic effect on inline_split_me, was one I made fairly late when working with this article. Here is a comparison between the regular TBLNUM$IL and TBLNUM$NOCNV which uses a version of inline_split_me without the convert, and consequently does not use a parallel plan. Here is data from all three machines, for the integer operations:

Server List length Int_UNPACK Int_JOIN
TBLNUM$IL $NOCNV Ratio TBLNUM$IL $NOCNV Ratio
JAMIE4K 20 4 4 1.00 13 6 0.46
200 33 34 1.03 54 44 0.81
650 123 125 1.02 147 157 1.07
2 000 378 385 1.02 418 490 1.17
10 000 1 850 1 894 1.02 977 2 408 2.46
KESÄMETSÄ 20 1 1 1.00 3 1 0.33
200 8 8 1.00 11 9 0.82
650 26 28 1.08 39 34 0.87
2 000 91 91 1.00 128 114 0.89
10 000 436 440 1.01 633 565 0.89
SOMMERSKOV 20 1 0   2 1 0.50
200 3 4 1.33 7 6 0.86
650 14 16 1.14 19 21 1.11
2 000 47 48 1.02 55 69 1.25
10 000 230 238 1.03 192 328 1.71

You can see that for the pure unpack, the effect of the type cast is miniscule. On the other hand, for the join, the effect is dramatic with enormous gains for longer lists on JAMIE4K and SOMMERSKOV. But it cuts both ways. On KESÄMETSÄ, the version without the type cast is always faster, because parallelism backfires here. And while accuracy is poor, there is an indication that parallelism only causes an overhead without returns for short list lengths.

inline_split_me lends itself to another interesting observation. The version that appears in the article was suggested to me by Brian W Perrin who had read my article for arrays and lists on SQL 2000 and found a modification that he thought would improve performance. For reference, here are both versions:

CREATE FUNCTION inline_split_me(@param nvarchar(MAX)) RETURNS TABLE AS
RETURN(SELECT ltrim(rtrim(convert(nvarchar(4000),
                  substring(@param, Number,
                            charindex(N',' COLLATE Slovenian_BIN2,
                                      @param + N',', Number) -
                            Number)
              ))) AS Value
       FROM   Numbers
       WHERE  Number <= convert(int, len(@param))
         AND  substring(N',' + @param, Number, 1) =
                        N',' COLLATE Slovenian_BIN2
go

CREATE FUNCTION inline_split_me_old(@param nvarchar(MAX)) RETURNS TABLE AS
RETURN(SELECT ltrim(rtrim(convert(nvarchar(4000),
                  substring(N',' + @param + N',', Number + 1,
                            charindex(N',' COLLATE Slovenian_BIN2,
                                      N',' + @param + N',', Number + 1) -
                            Number - 1)
              ))) AS Value
       FROM   Numbers
       WHERE  Number <= convert(int, len(N',' + @param + N',')) - 1
         AND  substring(N',' + @param + N',', Number, 1) =
                        N',' COLLATE Slovenian_BIN2
go

As you see, the old version adds commas before and after the input parameter everywhere it is referenced. It certainly looks less efficient, but how much does it really matter? Here are the results for unpack on JAMIE4K:

List length Str_UNPACK Int_UNPACK
TBLNUM$IL TBLNUM$ILOLD Ratio TBLNUM$IL TBLNUM$ILOLD Ratio
20 6 6 1.00 4 5 1.25
200 42 43 1.02 33 35 1.06
650 156 159 1.02 123 126 1.02
2 000 486 499 1.03 378 389 1.03
10 000 2 384 3 890 1.63 1 850 1 895 1.02

In all cases but one the difference is so scarce that it can be disputed whether it is significant at all. But then there is one case where Brian's version really gives an enormous performance boost. Or rather the other way round, at 10 000 string elements, the old version apparently hits a threshold and execution time goes through the roof. What sort of threshold this might be I have not investigated, and neither I have investigated whether Brian's version also hits a threshold further up. But I think that the important observation is that there are more threshold issues with the MAX data types than just the one we saw for fixed length.

You may wonder what happens when we join, and get the parallel plan. For the fun of it, here are the numbers for KESÄMETSÄ (where the parallel plan is not really good):

List length Str_JOIN Int_JOIN
TBLNUM$IL TBLNUM$ILOLD Ratio TBLNUM$IL TBLNUM$ILOLD Ratio
20 4 4 1.00 3 2 0.67
200 17 18 1.06 11 11 1.00
650 62 64 1.03 39 40 1.03
2 000 205 208 1.01 128 131 1.02
10 000 977 1 022 1.05 633 658 1.04

The threshold goes away more or less entirely. The ratios for 10 000 string elements are somewhat higher on SOMMERSKOV and JAMIE4K, but it's far cry from what happens without parallelism. No, I have no idea why.

Inserting Many Values

I like to give you a quick look at the methods for inserting many values. The main results included MANYSELECT which transforms the input string into an INSERT from an EXEC() of many SELECT statements. Let's also look at MANYUNION, where string is transformed to a big INSERT SELECT-UNION statement, and MANYINSERT where the transformation gives many INSERT VALUES statements. This is the results for the unpack operation on SOMMERSKOV:

Method Str_UNPACK Int_UNPACK
20 200 650 2 000 10 000 20 200 650 2 000 10 000
MANYSELECT 3 15 44 136 904   2 9 24 76 391
MANYUNION 5 99 885 11 533 371 813   3 39 312 4 797 176 834
MANYINSERT 15 173 531 1 566 7 631   14 152 463 1 319 6 601

As you can see, both MANYINSERT and MANYUNION trail far far behind the more complex MANYSELECT. But it's nevertheless interesting to compare MANYINSERT and MANYUNION. For the shorter list lengths, UNION performs better than INSERT, but UNION has a dreadful development. When the list length ten-folds from 200 to 2 000, the execution time for UNION hundred-folds!

The answer surely lies in the optimizer. For the optimizer the long SELECT UNION statement is very complex, even if it's all constants. For shorter list lengths, UNION manages to beat INSERT, because INSERT has an overhead for writing one row at a time. But as the input size increases, the overhead for the optimizer makes you lose everything you gained on the roundabout.

There is one more thing to comment, one we already saw in the main results. For all these methods, the execution time for unpacking a list of strings is considerably higher than for integers. I would assume that this is due to query processing and parsing: when unpacking strings there are more characters and also more tokens (the single quotes) to process.

Dynamic SQL

The test suite includes two set of test procedures for dynamic SQL, EXEC$A and EXEC$B, but apart from the name, they are identical. EXEC$B is simply EXEC$A run a second time with exactly the same input. This means that we can expect EXEC$B to find a plan for the dynamic SQL query in cache. This permits us to use EXEC$B to determine how much of EXEC$A that is compilation. This is the data from SOMMERSKOV. For reference, I also include the numbers for FIX$SINGLE

Method Str_JOIN Int_JOIN
20 200 650 2 000 10 000 20 200 650 2 000 10 000
EXEC$A 5 42 139 499 4 267   4 23 86 333 3 518
EXEC$B 1 3 12 42 156   0 1 5 19 415
FIX$SINGLE 1 4 15 49 299   0 3 8 32 151

We find that EXEC$B is faster than anything we have looked at in this article. And of course, it should be. There is no overhead of a table of numbers, no need to pass the nvarchar(MAX) to the CLR and search the string. Just look up the values. But how often do you run the same list of values twice? EXEC$B cannot count as a real method. And it is very obvious that the cost with dynamic SQL lies with the compilation.

There is one more observation to make: for the "join" over 10 000 integer values, EXEC$B does not perform equally well, about in par with CTE$IL. Looking at the values for the individual runs, the very most of them are the in the range 340-375 ms, and then there are two runs exceeding 3 500 ms. That is, occasionally, the plan from EXEC$A is for some reason not around in the cache when EXEC$B runs.

But that only explains why the average is 415 ms and not 355 ms, which is still far behind FIX$SINGLE. The answer to this lies in the query plan. The optimizer builds an internal table of constants, akin to a temp table, and remember what happened with a temp table when joining over the integer column with 10 000 elements: there was a table scan + hash match, and this is what happens in this case as well. In the main article, I said that dynamic SQL could be a last resort if you had problems to get good performance with short lists, because dynamic SQL is the only way for the optimizer to have full knowledge. But you can see here why I made the qualification that the lists should be short. For 10 000 elements, the optimizer thinks for three seconds, and then comes up with a non-optimal plan.

What happens when the setting force parameterisation is in effect? I ran a separate test on SOMMERSKOV to find out, and to make it quick, I only had ten runs per method. It's perfectly enough to show the effect though:

Method Str_JOIN Int_JOIN
20 200 650 2 000 10 000 20 200 650 2 000 10 000
EXEC$A 1 12 34 72 4142   1 4 16 56 3 399
EXEC$B 1 6 23 41 155   1 1 10 37 347
FIX$SINGLE 4 4 17 52 296   2 2 12 32 151

That was indeed the "go faster" switch. But you can see that EXEC$A does not beat FIX$SINGLE even with auto-parameterisation. For some reason the rerun in EXEC$B is still considerably faster, why I don't know. (Since other methods runs before EXEC$A with the same input, the data is already in cache, so that is not the reason.)

More importantly, the speed disappears completely for the longest lists. Since a batch or stored procedure in SQL Server cannot take more than 2 100 parameters, parameterisation does not happen when the input list has more elements.

The full results from the quick test with forced parameterisation included a few more methods, and you can view the results here.

All Methods in the Test

Here is a list of all methods in the test, with a brief description. Click each method name to see the test procedures, and from there you can get to the tested functions.

Again, here are the links for the full results for all machines: JAMIE4K, KESÄMETSÄ and SOMMERSKOV.

CLR$FIXFixed-length format, implemented with the CLR.
CLR$ITERRolling our own in the CLR.
CLR$SPLIT CLR using the split method.
CTE$CHUNKA chunked procedure using a recursive CTE.
CTE$IL cte_split_inline in the main article.
CTE$MSTMTA multi-statement function with the same contents as cte_split_inline.
EXEC$ADynamic SQL.
EXEC$BDynamic SQL run a second time.
FIX$BINARY fixbinary_single
FIX$ITERThe iterative method with fixed-length input.
FIX$MSTMTA multi-statement version of fixstring_single.
FIX$MULTI fixstring_multi.
FIX$MULTI2 fixstring_multi2.
FIX$SINGLE fixstring_single.
FIX$TMPTBL fixstring_single with data bounced over a temp-table.
ITER$CHUNKThe iterative method with chunks.
ITER$SIMPLEThe iterative method without chunking.
MANYINSERTMaking the list into many INSERT statements.
MANYSELECTMaking the list into many SELECT statements.
MANYUNIONMaking the list into many SELECT UNION.
REALSLOWUsing charindex.
SLOW$LIKEUsing LIKE.
TBLNUMchunk_split_me, same as duo_chunk_split_me, but with a single return column.
TBLNUM$DUO duo_chunk_split_me.
TBLNUM$IL inline_split_me.
TBLNUM$ILOLDAn older version of inline_split_me.
TBLNUM$NOBINA version of inline_split_me without COLLATE clauses.
TBLNUM$NOCNV inline_split_me where len(@param) is not cast to int.
XMLATTRAttribute-centred XML with nodes and input as xml.
XMLATTR$OPENAttribute-centred XML with OPENXML.
XMLATTR$STRAttribute-centred XML with nodes, and input as nvarchar(MAX).
XMLATTR$TMPTBLSame as XMLATTR, but results bounced over a temp table.
XMLATTR$TVARSame as XMLATTR, but results bounced over a table variable.
XMLELEMElement-centred XML with nodes and input as xml.
XMLELEM$OPENElement-centred XML with OPENXML.

Revisions

2010-10-10 – Updated the section on XML, since the tests on XML had some flaws.

2010-01-06 – There is now a second appendix with data from a new suite of tests on SQL 2008, to test new methods.

2008-12-20 – Added note that the issue with element-centred XML does not seem to appear in SQL  2005 SP3 and SQL 2008.

2007-03-03 – Initial version.

Back to my home page.