An SQL text by Erland Sommarskog, SQL Server MVP Latest revision 2010-10-10. Copyright applies to this text.
This is an appendix to the article Arrays and Lists in SQL Server 2005 and Beyond. In this appendix, I present data and analyses from the of performance tests I ran in 2006 for the various list-to-table methods. If you have arrived at this page from a web search, you may want to at least browse the main article first. There is a second appendix with tests that I ran in 2009 on SQL 2008, which covers many of the methods tested here as well, as well new methods that I added to the article after complete the original test suite, including table-value parameters, a new featyre in SQL 2008. The two appendixes are differently structured, and you need to read both to get a complete pictures. There are observations that I make only in this appendix and vice versa.
Table of Contents:
General Disclaimers
The Contenders
How the Tests Were Carried Out
The Test Table
The Test Data
Tested Operations
Sample Test Procedures
The Test Script
Used Hardware
The Results
List with 20 Elements
List with 200 Elements
List with 650 Elements
List with 2000 Elements
List with 10000 Elements
Conclusion
Special Observations
To Chunk or Not to Chunk
Inline, Multi-statement and Temp Tables
The Collation Issue
Fixed-Length and the String-Length Threshold
FIX$BINARY
CLR
XML
Both varchar and nvarchar in the Return Table
Variations of inline_split_me
Inserting Many Values
Dynamic SQL
All Methods in the Test
Revisions
I decided to keep these tests fairly simple. This means:
This means that these while these tests give us some information, there are also pieces missing. The numbers should be informative enough to tell us how fast or slow the methods as such are. But they do not tell us how they will work with your tables, and nor do they say much about how they will work under load.
Thus, if you need top performance, you may need to conduct your own tests with your tables. Maybe you need to include the client-side time it takes to compose XML documents and the extra network cost for XML or fixed-length if you are considering these methods. Maybe you need to conduct testing with many parallel processes, so that you see what happens when SQL Server comes under load. And real-world application tables are not likely to fit into cache.
I like to stress that you cannot use the results in this article and the old article for SQL 2000 to compare SQL 2005 to SQL 2000. There are important differences in the test setup that makes any such comparison impossible in general. The only exception is for dynamic SQL where the difference is drastic for large input.
Note: for the SQL 2000 tests, I made the entire test kit available in a zip file, but I have opted not to so this time. The test database is much larger, and a zip file of it all is almost 50 MB. All stored procedures and the test script are available through links in this appendix. Please contact me, if you also want the database.
As I have assembled material for this article, I have tested far too many methods to include data for all of them in the main body of this text. All methods that I have tested are derived from the methods I discuss in the main article. Some methods are important modifications to the main theme, whereas others are just small variations, like the same algorithm both in a inline function and a multi-statement function. For many of these small variations I first ran a quick test, and if that quick test did not reveal anything significant, I did not save that data. Eventually, 35 "methods" made it to the finals, run on three different machines.
From this, I have selected twelve methods for which I will present the results in detail. In the section Special Observations, I highlight various things of interest, and most other methods appear here. The full data for all 35 methods on all three test machines are available in text files. There are links to these files further down in this text.
I have given each method that I've tested a name, and I will use these names as a convenient short-hand notation, mixed with the name of the functions as they appeared in the main article. Throughout the appendix, these names appear as links and if you click the link, you will be brought to page with the test procedures, which in their turn have links to tested functions. Thus, in this way you can see exactly what I have tested.
Here are the twelve main competitors:
ITER$CHUNK | The iterative functions iter_intlist_to_tbl and iter_charlist_to_tbl. |
CLR$ITER | The CLR functions CLR_intlist_iter and CLR_charlist_iter. |
XMLATTR | Using nodes and value with attribute-centred XML. |
TBLNUM$IL | A table with numbers: inline_split_me. |
TBLNUM | A table of numbers: the function text_split_me which is the same as duo_text_split_me in the main article, save that it only returns an nvarchar column. |
FIX$SINGLE | Fixed-length list elements: the function fixstring_single, joining once with the table of numbers. |
FIX$MULTI | Fixed-length list elements: fixstring_multi, that joins the table of numbers twice to make it virtually unlimited. |
FIX$MSTMT | Fixed-length list elements: a multi-statement version of fixstring_single. |
CTE$IL | Recursive CTE, the function cte_split_inline. |
EXEC$A | Dynamic SQL. |
MANYSELECT | Making the list into a SELECT, the procedures unpack_with_manyselect and unpackstr_with_manyselect. |
REALSLOW | A really slow method, using charindex to find the list elements. |
The full list of all tested methods is at the end of this article.
This is the table I have used for all tests.
CREATE TABLE usrdictwords (wordno int NOT NULL, word nvarchar(50) NOT NULL, guid char(36) NOT NULL) CREATE UNIQUE CLUSTERED INDEX wordno_ix ON usrdictwords(wordno) CREATE UNIQUE INDEX word_ix ON usrdictwords(word)
The table contains 1 163 750 rows, and the average length of the column word is 10.6 characters. The longest word is 30 characters. wordno holds numbers in the range from 0 to 1 163 749. The order of wordno is not correlated with the order of word. The column guid serves to make the table a little wider, and to be a token non-indexed column. The data in guid is not used for anything in the tests.
The collation for the database was Slovenian_CS_AS. (The data is a taken from a Slovenian version of the Unix file /usr/dict/words, used for the spell utility, that I culled off the net somewhere.)
Note: the test table for SQL 2000 had the same schema, except that I used varchar in those tests, The volume in the SQL 2000 was a mere 200 000 words, despite the source that time too was a spelling dictionary for Slovenian. Somehow, Slovenian has expanded its vocabulary with almost a factor six in just a few years!
For each test, my test script randomly selected rows from usrdictwords, to construct one comma-separated list of strings and one list of integers from the columns word and wordno respectively. After each list item, the script randomly added 0-3 spaces between the list elements. (The reason for this is that the test script also served to validate that the methods yielded the correct result.) For fixed-length and XML the list was transformed to the respective formats. For the really slow methods, the random spacing was removed, as they are not able to cope with it.
I ran the test with five different list sizes: 20, 200, 650, 2 000 and 10 000 elements in the list. The basic strategy was to run each method 100 times. However, to avoid that slow methods kept the script running for days, the script applied a cut-off strategy, so that when a method had used a total of 15 minutes, it was not permitted to run more on that list size. It was still given the chance to run two runs on the next list size, as long it did not exceed the next limit of 75 minutes.
I have tested two types of operations:
UNPACK | Unpacking the input list itself into a result set, without involving another table. This test is not possible to carry out with dynamic SQL and the really slow methods. |
JOIN | Using the input list to extract data from the test table. With most methods, this means a join operation, and I denote this operation as JOIN also for a method like dynamic SQL even if there is no actual join taking place. |
Here is a sample of a typical UNPACK procedure:
CREATE PROCEDURE TBLNUM_Int_UNPACK_test @str nvarchar(MAX), @tookms int OUTPUT AS DECLARE @start datetime SELECT @start = getdate() INSERT Int_UNPACK(number) SELECT number = convert(int, str) FROM chunk_split_me(@str, DEFAULT) SELECT @tookms = datediff(ms, @start, getdate()); SELECT number FROM Int_UNPACK TRUNCATE TABLE Int_UNPACK
The procedure: 1) Accepts an nvarchar(MAX) parameter. 2) Starts a timer. 3) Issues the query, inserting the result set into a permanent, empty, table. 4) Stops the timer. 5) Returns the data to the test script, which checks that the procedure returned the expected data. 6) Truncates the table after the run.
I insert data into a table, since had I returned data to the client, the measured values would have been distorted by random factors not related to the tested methods. (You may wonder why I used INSERT and not SELECT INTO, which I did in the tests for SQL 2000. I had SELECT INTO, but I found that got random values that were several hundred ms higher than most other values for a certain test. I suspected that the reason was auto-stats on the system tables, which you cannot turn off. And, indeed, switching to INSERT lead to timings being more consistent.)
The test procedures for XML accept a parameter of the xml data type. If there is a larger overhead for receiving an xml parameter, this is not captured in the test. I discuss this problem in further detail in the section on XML under Special Observations.
I also give an example of a JOIN procedure to highlight one important detail with the string join:
CREATE PROCEDURE CLR$ITER_Str_JOIN_test @str nvarchar(MAX), @tookms int OUTPUT AS DECLARE @start datetime SELECT @start = getdate() INSERT Str_JOIN(wordno, guid) SELECT u.wordno, u.guid FROM usrdictwords u JOIN CLR_charlist_iter(@str, DEFAULT) AS a on u.word = a.str SELECT @tookms = datediff(ms, @start, getdate()); SELECT wordno FROM Str_JOIN TRUNCATE TABLE Str_JOIN
Note here that when joining over the string column in the test table, I am also retrieving the column guid which is not in any index. This forces SQL Server to access the data pages of the table. (Otherwise the query would be covered by the index on word.) This gives the optimizer the two choices: using the non-clustered index on word with bookmark lookups or scanning the table.
For most methods, there have been four procedures: Str_UNPACK, Int_UNPACK, Str_JOIN and Int_JOIN. You can retrieve the test procedures for a certain method by clicking on the method name as it appears in the article.
The test script is written in Perl, connecting to SQL Server using the SQL Native Client OLE DB provider.
The script creates the database with a size of 195 MB for the data file and 100 MB for the log file (the initial size is 750 MB during the data load, but the script shrinks it before running the tests). The intention was that the test procedures should be run with simple recovery, but they were in fact run with bulk-logged recovery. tempdb is not controlled by the script, but I made sure that tempdb had at least 50 MB for both data and log.
The test script runs with the default SET options you get with OLE DB, except that it also issues SET NOCOUNT ON.
When running the tests, the test script creates lists of strings and integers as described above, and then passes the assembled lists to all test procedures currently in the test database in alphabetic order. That is, all methods see the same lists. The script first runs all tests for lists with 20 elements, then all for 200 elements etc. When starting with a new list length, the script performs an sp_recompile on the usrdictwords table to flush all plans. It then runs a test 0 for that length, which is not included in the results, so that the cost for compilation does not cause distortions.
The test script uses a fixed seed for the random generator, so different runs of the script generate the same test data. This permitted to me test a method separately from the rest, yet using exactly the same data as for the others.
The script starts a transaction before running a test procedure and commits it after the run.
The script is able to call the test procedures through RPC or by sending an EXEC commands, but in the final test runs I ran all tests with RPC calls. (The choice of calling method should reasonably not have any importance, but during the SQL 2000 tests, fixed-length methods behaved very differently depending of the call method for some list sizes. In my preliminary tests I was not able to repeat this for SQL 2005.)
You can view the complete test script here.
I ran the tests on three different machines:
JAMIE4K | This is a server with 4 Pentium III, 550 GHz, 3 GB of RAM, disk arrays. Windows 2000 SP4. This server was also used for the SQL 2000 tests. |
KESÄMETSÄ | This is my workstation at home, with a HT Pentium 4 at 2.8 GHz, 1.5 GB of RAM. Windows XP SP2. IDE disks. (The same machine appeared in the SQL 2000 tests, but I have replaced most of the hardware have since then.) |
SOMMERSKOV | Another machine at home with an AMD Athlon X2 4200, 2.2 GHz, dual core. 2 GB of RAM. Windows XP x64 SP1. SATA disks. |
When running the tests for KESÄMETSÄ and SOMMERSKOV, I ran the test scripts locally. When running the tests for JAMIE4K, I ran the script from my workstation at work. (JAMIE4K is one of our development servers.)
In the main presentation of the results in this article, I only show the results from JAMIE4K for the shorter lists length. I show data for KESÄMETSÄ and SOMMERSKOV for the tests with 2 000 and 10 000 elements. There are several reasons for this. One is that of the test machines, JAMIE4K is the only one that resembles a production server. (In fact, it has a history in the dim and distant past as a production box.) Its CPU specs may seem humble, but that is actually an advantage here. SQL Server has a resolution of 3.33 ms for datetime, and in my experience it's impossible when measuring with getdate to get a result greater than zero, but lower than 13 ms. On SOMMERSKOV and KESÄMETSÄ, there are plentiful of runs on 0 ms, even for 650 elements, whereas JAMIE4K only has a handful for 200 elements, and none at 650 elements.
I should also note that on KESÄMETSÄ, I ran the tests with hyper-threading enabled, and with SQL Server configured to have both logical CPUs available. Normally, you disable hyper-threading when you run SQL Server on hyper-threaded CPUs, or at least you set the configuration option max degree of parallelism to the number of physical CPUs (or CPU cores), since parallel plans do not benefit from hyper-threading. (As I understand it, hyper-threading is derived by using free cycles in the CPU while waiting for data to come in from memory. But with parallel queries, data comes in all the time, and there are no free cycles, so the threads intended to be parallel are instead serialised.)
Here I present average execution times for the twelve main contenders on JAMIE4K for all lists length, and for the two biggest list sizes, I also include their results on KESÄMETSÄ and SOMMERSKOV. To see the full results for all 35 methods for each machine with average, minimum, maximum and median execution times, standard deviation and variation coefficient, here is one link each for JAMIE4K, KESÄMETSÄ and SOMMERSKOV The files are grouped by list size, data type and operation. Within each group, sorting is by average execution time. At the end of the article there is a brief description of all methods.
This section has a lot of numbers, too many numbers the reader may feel after a while. I am fairly brief in explaining the reason for results that may be unexpected, and instead defer those to the section Special Observations. This is because the explanations are often more relevant than the numbers as they may reveal things that I was not able to capture in the tests.
All numbers are derived from SQL 2005 SP1, unless I say otherwise.
All timings are in milliseconds.
Method | # runs | Str_UNPACK | Int_UNPACK | Str_JOIN | Int_JOIN |
---|---|---|---|---|---|
ITER$CHUNK | 100 | 8 | 7 | 10 | 9 |
CLR$ITER | 100 | 3 | 2 | 5 | 5 |
XMLATTR | 100 | 4 | 3 | 5 | 4 |
TBLNUM$IL | 100 | 6 | 4 | 18 | 13 |
TBLNUM | 100 | 6 | 5 | 8 | 7 |
FIX$SINGLE | 100 | 1 | 2 | 7 | 4 |
FIX$MULTI | 100 | 13 | 11 | 17 | 13 |
FIX$MSTMT | 100 | 5 | 4 | 7 | 6 |
CTE$IL | 100 | 5 | 6 | 9 | 8 |
EXEC$A | 100 | 33 | 13 | ||
MANYSELECT | 100 | 14 | 9 | 17 | 12 |
REALSLOW | 12 | 66 712 | 7 515 |
For this list size all methods are very fast, but one: REALSLOW, which is just that: really slow. It does a lot better on integer lists, since in this case it is possible to force a binary collation. But it's still far behind everything else.
For the other methods, the timings are so low, that one should be very careful to draw any conclusions from the numbers above. Dynamic SQL lags behind a bit, and we can note that FIX$MULTI has two-digit numbers for all operations, and so has TBLNUM$IL when joining. You may recall from the main article, that in all these six cases, there is a parallel query plan. While the accuracy is poor, it's still an indication that for short lists, there parallelism buys you an overhead.
From this list length I include the ranking for each group. You see this as a grey italic number to the right in each cell.
Method | # runs | Str_UNPACK | Int_UNPACK | Str_JOIN | Int_JOIN | ||||
---|---|---|---|---|---|---|---|---|---|
ITER$CHUNK | 100 | 58 | 9 | 59 | 10 | 81 | 9 | 67 | 10 |
CLR$ITER | 100 | 19 | 2 | 19 | 2 | 40 | 2 | 27 | 2 |
XMLATTR | 100 | 26 | 4 | 26 | 4 | 48 | 4 | 35 | 5 |
TBLNUM$IL | 100 | 42 | 7 | 33 | 7 | 76 | 8 | 54 | 8 |
TBLNUM | 100 | 28 | 6 | 26 | 4 | 49 | 5 | 34 | 4 |
FIX$SINGLE | 100 | 13 | 1 | 13 | 1 | 37 | 1 | 22 | 1 |
FIX$MULTI | 100 | 23 | 3 | 23 | 3 | 47 | 3 | 33 | 3 |
FIX$MSTMT | 100 | 27 | 5 | 26 | 4 | 49 | 5 | 35 | 5 |
CTE$IL | 100 | 44 | 8 | 43 | 8 | 67 | 7 | 53 | 7 |
EXEC$A | 100 | 203 | 11 | 112 | 11 | ||||
MANYSELECT | 100 | 93 | 10 | 58 | 9 | 115 | 10 | 66 | 9 |
REALSLOW | 2 | 663 518 | 12 | 21 538 | 12 |
The field is starting to spread out and we can identify a group of faster methods consisting of the CLR, XML, TBLNUM and the three fixed-length methods of which FIX$SINGLE is leading the pack. There is also a group of slower methods with the CTE, TBLNUM$IL, the iterative method, MANYSELECT and dynamic SQL, roughly in that order.
REALSLOW is getting completely out of hand.
Method | # runs | Str_UNPACK | Int_UNPACK | Str_JOIN | Int_JOIN | ||||
---|---|---|---|---|---|---|---|---|---|
ITER$CHUNK | 100 | 185 | 9 | 180 | 10 | 253 | 9 | 217 | 10 |
CLR$ITER | 100 | 67 | 3 | 62 | 3 | 133 | 3 | 90 | 3 |
XMLATTR | 100 | 88 | 6 | 85 | 6 | 159 | 6 | 116 | 6 |
TBLNUM$IL | 100 | 156 | 8 | 123 | 7 | 238 | 8 | 147 | 7 |
TBLNUM | 100 | 85 | 5 | 80 | 4 | 153 | 5 | 106 | 4 |
FIX$SINGLE | 100 | 47 | 1 | 46 | 1 | 119 | 2 | 75 | 2 |
FIX$MULTI | 100 | 58 | 2 | 57 | 2 | 118 | 1 | 69 | 1 |
FIX$MSTMT | 100 | 84 | 4 | 83 | 5 | 151 | 4 | 107 | 5 |
CTE$IL | 100 | 155 | 7 | 153 | 8 | 226 | 7 | 181 | 8 |
EXEC$A | 100 | 634 | 11 | 369 | 11 | ||||
MANYSELECT | 100 | 300 | 10 | 179 | 9 | 377 | 10 | 207 | 9 |
REALSLOW is not in the list any more, but it was taken off the racing course by the test script after the runs for 200 list elements.
We can start to discern an order in the fast group as well, with the two inline functions for fixed-length taking the lead, ahead of the CLR, while the other three in this group that still are fairly close together. Notice that the simple inline function for fixed-length is faster for unpack, but when joining, the complex function is the quickest. The parallel plan is starting to pay off.
In the slower bunch, the order is the same as for 200 list elements with one exception: TBLNUM$IL is now clearly ahead of the CTE for integer operations.
Method | # runs | Str_UNPACK | Int_UNPACK | Str_JOIN | Int_JOIN | ||||
---|---|---|---|---|---|---|---|---|---|
ITER$CHUNK | 100 | 565 | 9 | 571 | 10 | 775 | 9 | 657 | 10 |
CLR$ITER | 100 | 205 | 3 | 188 | 3 | 413 | 3 | 276 | 3 |
XMLATTR | 100 | 272 | 6 | 260 | 6 | 499 | 6 | 351 | 6 |
TBLNUM$IL | 100 | 486 | 8 | 378 | 7 | 519 | 7 | 418 | 7 |
TBLNUM | 100 | 257 | 4 | 235 | 4 | 464 | 4 | 318 | 4 |
FIX$SINGLE | 100 | 145 | 1 | 136 | 1 | 364 | 2 | 230 | 2 |
FIX$MULTI | 100 | 162 | 2 | 154 | 2 | 333 | 1 | 167 | 1 |
FIX$MSTMT | 100 | 257 | 4 | 247 | 5 | 466 | 5 | 328 | 5 |
CTE$IL | 100 | 480 | 7 | 472 | 8 | 693 | 8 | 567 | 8 |
EXEC$A | 100 | 2 130 | 11 | 1 304 | 11 | ||||
MANYSELECT | 100 | 1 186 | 10 | 547 | 9 | 1 404 | 10 | 635 | 9 |
The trend from the previous list size remains: In the top we find the two fixed-length inline functions, with FIX$SINGLE somewhat faster for unpack. But FIX$MULTI is now distinctively faster for the join operation over the integer column. The CLR is very clearly number three. Of the other three faster methods, TBLNUM is faster for the integer operation, whereas the multi-statement fixed-length function wins on the strings. XML is slightly behind these two.
In the slow group TBLNUM$IL is distancing itself even more from the CTE and for the join operations it is now closer to XML. Dynamic SQL is lagging more and more behind, more or a less in a group of its own now.
Up to 650 list elements, the execution times on the other two test machines are too low to be accurate. But with 2 000 list elements, it's worth looking at these as well. Here are the results from KESÄMETSÄ, the other 32-bit machine:
Method | # runs | Str_UNPACK | Int_UNPACK | Str_JOIN | Int_JOIN | ||||
---|---|---|---|---|---|---|---|---|---|
ITER$CHUNK | 100 | 125 | 9 | 130 | 10 | 190 | 8 | 151 | 10 |
CLR$ITER | 100 | 42 | 3 | 41 | 3 | 104 | 3 | 62 | 3 |
XMLATTR | 100 | 59 | 5 | 59 | 6 | 130 | 6 | 82 | 6 |
TBLNUM$IL | 100 | 117 | 8 | 91 | 7 | 205 | 9 | 128 | 7 |
TBLNUM | 100 | 59 | 5 | 54 | 5 | 122 | 5 | 73 | 4 |
FIX$SINGLE | 100 | 30 | 1 | 29 | 1 | 95 | 1 | 51 | 1 |
FIX$MULTI | 100 | 38 | 2 | 36 | 2 | 99 | 2 | 59 | 2 |
FIX$MSTMT | 100 | 55 | 4 | 53 | 4 | 119 | 4 | 74 | 5 |
CTE$IL | 100 | 107 | 7 | 107 | 8 | 171 | 7 | 131 | 8 |
EXEC$A | 100 | 677 | 11 | 431 | 11 | ||||
MANYSELECT | 100 | 234 | 10 | 126 | 9 | 300 | 10 | 146 | 9 |
The outcome is similar to the results on JAMIE4K. There is one significant difference: for the join operations, FIX$SINGLE is faster than FIX$MULTI on KESÄMETSÄ, despite the complex function winning big on JAMIE4K. This is not surprising, given that KESÄMETSÄ with its single hyper-threaded CPU is not a good venue for parallel plans. Another token of this is that TBLNUM$IL is even slower than the iterative method in one case – recall that when joining TBLNUM$IL is a parallel-planner. If you look closer into the numbers, you can sense that the CLR is not too distant from fixed-length, and that XML still has contact to TBLNUM and FIX$MSTMT.
Let's look at the 64-bit machine SOMMERSKOV:
Method | # runs | Str_UNPACK | Int_UNPACK | Str_JOIN | Int_JOIN | ||||
---|---|---|---|---|---|---|---|---|---|
ITER$CHUNK | 100 | 81 | 9 | 81 | 10 | 113 | 9 | 93 | 10 |
CLR$ITER | 100 | 23 | 3 | 21 | 3 | 58 | 3 | 33 | 3 |
XMLATTR | 100 | 38 | 6 | 35 | 6 | 70 | 6 | 45 | 6 |
TBLNUM$IL | 100 | 61 | 7 | 47 | 7 | 73 | 7 | 55 | 7 |
TBLNUM | 100 | 32 | 5 | 30 | 4 | 65 | 5 | 39 | 4 |
FIX$SINGLE | 100 | 18 | 1 | 16 | 1 | 49 | 2 | 32 | 2 |
FIX$MULTI | 100 | 20 | 2 | 18 | 2 | 48 | 1 | 27 | 1 |
FIX$MSTMT | 100 | 31 | 4 | 30 | 4 | 63 | 4 | 41 | 5 |
CTE$IL | 100 | 68 | 8 | 68 | 8 | 99 | 8 | 83 | 8 |
EXEC$A | 100 | 499 | 11 | 333 | 11 | ||||
MANYSELECT | 100 | 136 | 10 | 76 | 9 | 168 | 10 | 87 | 9 |
On this machine FIX$MULTI manages to win the join operations, if not equally decisive as on JAMIE4K. (But there are only two CPU:s here) Else, the observations for the fast group are the same as on the other machines.
In the slow group, the CTE, the iterative method and MANYSELECT are very close to each other for the integer operations, unlike JAMIE4K where there were distinct differences between them. For most operations, TBLNUM$IL is closer to XML than to the CTE, so it's a bit unfair to put it into the slow group here.
First the numbers for JAMIE4K:
Method | # runs | Str_UNPACK | Int_UNPACK | Str_JOIN | Int_JOIN | ||||
---|---|---|---|---|---|---|---|---|---|
ITER$CHUNK | 42 | 2 761 | 9 | 2 778 | 9 | 3 608 | 9 | 3 183 | 9 |
CLR$ITER | 100 | 973 | 3 | 899 | 3 | 1 785 | 4 | 1 268 | 4 |
XMLATTR | 100 | 1 312 | 6 | 1 253 | 6 | 2 229 | 7 | 1 720 | 7 |
TBLNUM$IL | 93 | 2 384 | 8 | 1 850 | 7 | 1 469 | 2 | 977 | 2 |
TBLNUM | 100 | 1 154 | 5 | 1 036 | 4 | 1 992 | 5 | 1 440 | 5 |
FIX$SINGLE | 100 | 667 | 1 | 629 | 1 | 1 531 | 3 | 1 109 | 3 |
FIX$MULTI | 100 | 701 | 2 | 661 | 2 | 1 399 | 1 | 788 | 1 |
FIX$MSTMT | 100 | 1 152 | 4 | 1 102 | 5 | 1 997 | 6 | 1 519 | 6 |
CTE$IL | 54 | 2 347 | 7 | 2 312 | 8 | 3 216 | 8 | 2 782 | 8 |
EXEC$A | 16 | 13 787 | 11 | 11 271 | 11 | ||||
MANYSELECT | 17 | 6 471 | 10 | 2 843 | 10 | 7 343 | 10 | 4 665 | 10 |
Up from the dark! The big sensation is inline_split_me, which on the join operations suddenly appears as #2, beaten only by the complex fixed-length function. That is, the two methods that result in parallel plans are the winners. Recall that TBLNUM$IL only has a parallel plan for joining, and indeed its numbers for the unpack operations are bleak This is in difference to fixstring_multi where the parallelism is there to resolve to the cross-join of the Numbers table, so it has parallelism also for the unpack.
Apart from TBLNUM$IL's raise to the top, the main pattern from 2 000 elements remains. FIX$MULTI is still fastest for joining, and FIX$SINGLE is now at #3, clearly ahead of the CLR which in its turn is ahead of TBLNUM and FIX$MSTMT that still stay close, particularly for the string operations. XML is quite distinctively behind these two now.
In the slower bunch, there are two observations to make. One is that dynamic SQL gets out of hand. The string length increased with a factor 5, but the execution time for dynamic SQL increases a lot more than that. The other is that MANYSELECT now loses big against the iterative method on the integer join. Recall that the two methods were close on both integer operations for previous list lengths, as they still are for the unpack. Why this happens I will return to in the section Inline, Multi-statement and Temp Tables.
Let's look at KESÄMETSÄ:
Method | # runs | Str_UNPACK | Int_UNPACK | Str_JOIN | Int_JOIN | ||||
---|---|---|---|---|---|---|---|---|---|
ITER$CHUNK | 100 | 617 | 9 | 642 | 9 | 871 | 8 | 740 | 9 |
CLR$ITER | 100 | 208 | 3 | 198 | 3 | 464 | 3 | 294 | 3 |
XMLATTR | 100 | 289 | 6 | 286 | 6 | 578 | 6 | 406 | 6 |
TBLNUM$IL | 100 | 573 | 8 | 436 | 7 | 977 | 9 | 633 | 7 |
TBLNUM | 100 | 264 | 5 | 234 | 4 | 516 | 5 | 334 | 4 |
FIX$SINGLE | 100 | 139 | 1 | 132 | 1 | 390 | 1 | 245 | 1 |
FIX$MULTI | 100 | 165 | 2 | 156 | 2 | 415 | 2 | 264 | 2 |
FIX$MSTMT | 100 | 247 | 4 | 237 | 5 | 499 | 4 | 337 | 5 |
CTE$IL | 100 | 527 | 7 | 522 | 8 | 785 | 7 | 637 | 8 |
EXEC$A | 66 | 6 085 | 11 | 5 180 | 11 | ||||
MANYSELECT | 100 | 1 615 | 10 | 659 | 10 | 1 882 | 10 | 1 254 | 10 |
Compared to 2 000 elements, there no drastic differences, but compared to JAMIE4K, there is. That is, on this machine, inline_split_me, is only penalised for its parallel plan. And similarly, FIX$MULTI is behind FIX$SINGLE, although not with much. Else the results are very similar to those on JAMIE4K. CLR is clearly at #3, and TBLNUM and FIX$MSTMT are close even if with a slightly different distribution. XML falls behind the latter two, more than for 2 000 elements.
In the slower group, dynamic SQL has an increase in execution time which is far more than five-fold. And, as on JAMIE4K, MANYSELECT has a poor result for the integer join.
Here are the results for the 64-bit SOMMERSKOV, and it's time for new surprises:
Method | # runs | Str_UNPACK | Int_UNPACK | Str_JOIN | Int_JOIN | ||||
---|---|---|---|---|---|---|---|---|---|
ITER$CHUNK | 100 | 399 | 9 | 406 | 10 | 520 | 9 | 459 | 9 |
CLR$ITER | 100 | 109 | 1 | 99 | 3 | 230 | 1 | 142 | 2 |
XMLATTR | 100 | 178 | 3 | 169 | 6 | 287 | 4 | 208 | 7 |
TBLNUM$IL | 100 | 300 | 7 | 230 | 7 | 281 | 3 | 192 | 6 |
TBLNUM | 100 | 147 | 2 | 132 | 5 | 273 | 2 | 177 | 4 |
FIX$SINGLE | 100 | 179 | 4 | 71 | 1 | 299 | 6 | 151 | 3 |
FIX$MULTI | 100 | 192 | 5 | 83 | 2 | 297 | 5 | 105 | 1 |
FIX$MSTMT | 100 | 236 | 6 | 130 | 4 | 365 | 7 | 179 | 5 |
CTE$IL | 100 | 332 | 8 | 330 | 8 | 440 | 8 | 410 | 8 |
EXEC$A | 100 | 4 267 | 11 | 3 518 | 11 | ||||
MANYSELECT | 100 | 904 | 10 | 391 | 9 | 1 035 | 10 | 701 | 10 |
For the string operations, CLR is now the undisputed winner, TBLNUM comes in second, and for the join operation, TBLNUM$IL is just behind, enjoying two real CPUs for its parallel plan. Next is XML. Or to put it differently, the fixed-length method runs into a trouble and gets a drastic drop in performance. I discuss this at length in the section Fixed-Length and the String-Length Threshold.
For the integer operations, things are more familiar. The complex fixed-length function wins the integer join, A little surprise is that the CLR sneaks in as #2, but the field is tight here. On this machine too, XML falls behind TBLNUM, although it manages to keep contact for one of the join operations.
The observations for dynamic SQL and MANYSELECT are the same as on the other two machines.
There are some funny observations that calls for explanation, and I will return to these in the section on Special Observations.
However, I think that the most important observation one can make from these results is that up to a list length of 2 000 elements all methods - save REALSLOW – as such are good enough. But 2 000 elements is also about the upper limit where dynamic SQL is defendable from the point of view of performance. At 10 000 elements, the other methods still perform sub-second on KESÄMETSÄ and SOMMERSKOV, although MANYSELECT does this only for integers. (And, yes, only for the unpack, but it has an excuse for the join over the integers, that I will return to.) The results on JAMIE4K are certainly a lot more gloomy, but with its 550 MHz CPU:s it is nowhere representative from what you can expect from a production machine today.
Thus, as long as you don't expect your list-to-table function to run in a very busy environment, you can pick the method which you feel most comfortable with. ...with the important disclaimer that there is no way to tell from these tests how the methods will interact with your query.
Keep in mind that I ran these tests on idle machines and from a single process. So if you expect your list-to-table function run on a machine where everything else is happening at the same time, or you if you expect multiple clients to hammer the server with lists, then there is all reason to stick to the faster methods: fixed-length, table-of-numbers, the CLR or XML. And you should probably avoid the methods that get their speed through parallel plans (after all, they are consuming more resources). If you have the need for absolute top performance, there is all reason to conduct your own tests.
The same goes if you need to handle even longer lists, than I've run my tests over. As you can sense there are some thresholds. See how at 10 000 elements, inline_split_me rose to glory with 4 CPU:s and how fixed-length collapsed on 64-bit. Surely, there are more surprises that I for one reason or another did not capture in my tests.
In this section I go into deeper details of the test results, and look at methods that are variations of main contenders to highlight important observations, and I will try to explain why some things happen the way they do.
For several methods, I presented a version that sliced up the input into chunks, and said this would give better performance. You have already been able to compare TBLNUM$IL and TBLNUM in the main results, but will include those numbers again to highlight the effect. I will also look at the iterative method, introducing ITER$SIMPLE, which akin to the iter$simple_intlist_to_tbl in the beginning of the main article.
Since this is about getting the data out from the list, I focus on the numbers for the unpack operation. I only present data for JAMIE4K as the result is the same on all machines.
List length | Str_UNPACK | Int_UNPACK | |||||
---|---|---|---|---|---|---|---|
ITER$CHUNK | ITER$SIMPLE | Ratio | ITER$CHUNK | ITER$SIMPLE | Ratio | ||
20 | 8 | 8 | 1.00 | 7 | 9 | 1.29 | |
200 | 58 | 65 | 1.12 | 59 | 74 | 1.25 | |
650 | 185 | 221 | 1.19 | 188 | 257 | 1.37 | |
2 000 | 565 | 676 | 1.20 | 571 | 788 | 1.38 | |
10 000 | 2 761 | 3 324 | 1.20 | 2 778 | 3 885 | 1.40 | |
TBLNUM | TBLNUM$IL | Ratio | TBLNUM | TBLNUM$IL | Ratio | ||
20 | 6 | 6 | 1.00 | 5 | 4 | 0.80 | |
200 | 28 | 42 | 1.50 | 26 | 33 | 1.27 | |
650 | 85 | 156 | 1.84 | 80 | 123 | 1.54 | |
2 000 | 257 | 486 | 1.89 | 235 | 378 | 1.61 | |
10 000 | 1 154 | 2 384 | 2.07 | 1 036 | 1 850 | 1.79 |
You can clearly see that the chunked version is always faster than the simpler non-chunked version, with exception for the shortest list length where timings are not very accurate. You can also see that the ratio between the non-chunked version and the chunked version increases as the list size increases, and this is particularly prevalent for the table-of-numbers method. Yes, for the join operation, the development goes in the completely opposite direction for the table-of-numbers method on JAMIE4K, due to parallelism, but here I wanted to highlight the costs of working with nvarchar(MAX).
You can see that the gain from chunking is much bigger with the table-of-numbers method than with the iterative method. Both call charindex when they have has found a delimiter, but the table-of-numbers method also calls substring for every character.
For the iterative method, the penalty for the simple method is far higher for a list of numbers than for a list of strings. How come? The version of iter$simple_intlist_to_table that appears in the article uses comma as delimiter, but ITER$SIMPLE calls a version with space as a delimiter, just like I used for iter_intlist_to_table, and when it finds a space it searches for the next one. For a list with a single space between each number, this would not be an issue, but the test script generates a random spacing of one to four characters between the list elements. Thus, ITER$SIMPLE has to make more calls to charindex for the integer lists.
The tests also included a chunked version of the CTE method, CTE$CHUNK. I show only the numbers for unpack of strings, as the results for unpack of integers are very similar. Instead I add the results from SOMMERSKOV for an interesting twist:
List length | JAMIE4K | SOMMERSKOV | |||||
---|---|---|---|---|---|---|---|
CTE$CHUNK | CTE$IL | Ratio | CTE$CHUNK | CTE$IL | Ratio | ||
20 | 5 | 7 | 1.40 | 0 | 1 | 0.50 | |
200 | 44 | 52 | 1.18 | 6 | 7 | 1.17 | |
650 | 155 | 168 | 1.08 | 22 | 21 | 0.95 | |
2 000 | 480 | 508 | 1.06 | 68 | 64 | 0.94 | |
10 000 | 2 347 | 2 481 | 1.06 | 332 | 316 | 0.95 |
It may come as a surprise that the chunked version is actually somewhat slower than the non-chunked version on SOMMERSKOV, but this is a trick of the eye. Keep in mind that chunking calls for a multi-statement function, and apparently the cost for the bouncing the data over a table variable is somewhat higher than the cost for charindex on nvarchar(MAX) on SOMMERSKOV. Whereas on JAMIE4K, the relations are the other way around. I have not investigated what this may be due to, but I place my bets on that it's mere chance. In fact, in an earlier round of tests, the results were the opposite: inline was faster on JAMIE4K and chunking was faster on SOMMERSKOV!
Had I instead presented the numbers for CTE$MSTMT (a non-chunked multi-statement function) vs. CTE$CHUNK, you would have seen ratios similar to those for the iterative method, which is to be expected, since the CTE method too calls substring only when it has found a list element.
In the main results you could compare the inline FIX$SINGLE to a multi-statement function with the same contents as fixstring_single, used by FIX$MSTMT. Let's also throw in a method where the output from fixstring_single is bounced over a temp table, FIX$TMPTBL. I will also look a similar triplet for XML. In XMLATTR, I use nodes directly in the query; this is what I called opaque inline in the beginning of the main article. In XMLATTR$TVAR I bounce the data over a table variable and in XMLATTR$TMPTBL over a temp table.
Here are the numbers for this six methods on JAMIE4K:
Method | Str_JOIN | Int_JOIN | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
20 | 200 | 650 | 2 000 | 10 000 | 20 | 200 | 650 | 2 000 | 10 000 | ||
FIX$SINGLE | 7 | 37 | 119 | 364 | 1 531 | 4 | 22 | 75 | 230 | 1 109 | |
FIX$MSTMT | 7 | 49 | 151 | 466 | 1 997 | 6 | 35 | 107 | 328 | 1 519 | |
FIX$TMPTBL | 5 | 42 | 138 | 437 | 1 899 | 5 | 30 | 95 | 289 | 2 797 | |
XMLATTR | 5 | 48 | 159 | 499 | 2 229 | 4 | 35 | 116 | 351 | 1 720 | |
XMLATTR$TVAR | 7 | 54 | 176 | 553 | 2 513 | 7 | 41 | 135 | 412 | 2 016 | |
XMLATTR$TMPTBL | 8 | 55 | 178 | 555 | 2 515 | 5 | 40 | 134 | 414 | 3 405 |
The inline methods, FIX$SINGLE and XMLATTR are always the fastest, which is "the way it should to be". (But in the world of an optimizer working with estimates, things are not always that way.) When comparing temp tables to table variables, there is, for me at least, an unexpected difference. For XML, the numbers for the temp-table and table-variable methods are very similar (except for the join over integers for 10 000 elements, which I will return to shortly). On the other hand, FIX$TMPTBL is clearly ahead of FIX$MSTMT (save that join over 10 000 integers). I don't have any final answer for why it is so. Query plans are identical, so it seems that bouncing the data over a table variable through a table-valued UDF is more expensive than bouncing over a local table variable or a temp table. (I find it difficult to believe that XML vs. fixed-length would make the difference.)
The most interesting in this table is what happens to the methods with temp tables in the join over 10 000 elements, where execution times are suddenly a lot higher than for FIX$MSTMT and XMLATTR$TVAR. The answer for this lies in the query plan. The "normal" query plan for the integer join is to scan the source table created from the list of values, and then perform a nested-loop join against usrdictwords, the test table. This is also the plan you get, when the optimizer has no particular information. But a temp table has statistics, so the optimizer gets a second chance and recompiles the SELECT statement, once the temp table has been filled, and for 10 000 elements, the plan consists of a hash join with single table scan of the test table.
As you can see from the numbers, this was not the best plan. Thus, it is a bad idea to use a temp table? Not really so. Assume that I would have run a test for 50 000 or 100 000 list elements, what would have happened then? With the methods XMLATTR$TVAR and FIX$MSTMT, the optimizer will never have any information about what is in that table variable, so it would still have gone for the nested-loops join. And at some size, the loop join would indeed be more expensive than the table scan and the hash join.
And this is not really any different for the inline methods. Looking the estimates in the query plan for FIX$SINGLE makes it very obvious that the optimizer has no clue of what is going on. Whatever the length of the input string, the optimizer estimates that we will read 300 000 rows from Numbers – this is a default assumption of 30 % of the rows. Thankfully, for some reason, the optimizer estimates that the read against the test table will only hit one row. (Had it assumed 30 %, it would not have picked nested-loops join and the result for short strings would have been embarrassing.) So no matter how long the input string is, FIX$SINGLE will always use the same query plan. The same is applies to XMLATTR – even if it has a completely different default estimate.
Before I close this section, I like to point out that we saw a token of this in the main results, when MANYSELECT suddenly fell behind the iterative method when joining over 10 000 integers. MANYSELECT is the only method of the main contenders to use a temp table.
All through the body of the main article, I had this COLLATE
Slovenian_BIN2
wherever I was looking for a delimiter. I figured that I
should give you an idea of how much effect this gives. So in the test suite, I
included TBLNUM$NOBIN which calls a version of inline_split_me without the
COLLATE clause. I show the results for the unpack of strings on JAMIE4K on
SOMMERSKOV:
List length | JAMIE4K | SOMMERSKOV | |||||
---|---|---|---|---|---|---|---|
TBLNUM$IL | TBLNUM$NOBIN | Ratio | TBLNUM$IL | TBLNUM$NOBIN | Ratio | ||
20 | 6 | 7 | 1.17 | 0 | 1 | ||
200 | 42 | 54 | 1.29 | 6 | 8 | 1.33 | |
650 | 156 | 191 | 1.22 | 19 | 25 | 1.32 | |
2 000 | 486 | 600 | 1.23 | 61 | 80 | 1.31 | |
10 000 | 2 384 | 2 969 | 1.25 | 300 | 396 | 1.32 |
As you see, the gains of the COLLATE clause are very obvious, even if it is not a dramatic difference. I picked the table-of-numbers method for this comparison, since no other method makes so many comparisons against the delimiter as this method. (For other methods the gain is smaller. I seem to recall that for the iterative method, the improvement was around 10 %. Then again, look at the numbers for REALSLOW to see a really drastic difference.)
A very puzzling observation was when fixed length was overtaken by the CLR, TBLNUM and XML on SOMMERSKOV for 10 000 elements, but only for strings. Why only for strings and why only on one machine?
Why it only happens for strings is not so difficult to understand, and neither why just fixed-length is the victim. The length of the input strings is longer for fixed length for any other method, and in my tests the fixed-element length was 9 for integers and 30 for strings, yielding a total length of 300 000 characters in the latter case. Thus, it seems reasonable to assume that we have hit a threshold in the handling of nvarchar(MAX) data type. Exactly in what sense, I don't know, but since an nvarchar(MAX) variable can fit 2 GB of data, it would not be surprising that SQL Server would handle it differently depending on size.
That leaves us with the question, why does this threshold occur only on one machine? No, it is not an issue with the query plan this time. The plans and the estimates on SOMMERSKOV and the other two machines are identical. Another question is of course: what happens if we increase the string length even further? Will execution times rise even more steeply, or will it flatten out again?
To investigate this, I composed a test that performed the unpack operations for all lengths from 60 000 characters (which corresponds to 2 000 list elements in the main test) up to 900 000 characters in steps of 60 000:
CREATE TABLE #fix (len int NOT NULL, ms int NOT NULL) DECLARE @len int, @sql nvarchar(MAX), @delta int SELECT @delta = 60000 SELECT @len = @delta WHILE @len <= 900000 BEGIN SELECT @sql = 'DECLARE @tookms int, @str nvarchar(MAX) SELECT @str = replicate(convert(nvarchar(MAX), ''X''), ' + ltrim(str(@len)) + ') EXEC FIX$SINGLE_Str_UNPACK_test @str, @tookms OUTPUT SELECT len(@str), @tookms' --PRINT @sql INSERT #fix(len, ms) EXEC(@sql) SELECT @len = @len + @delta END SELECT f1.len, f1.ms, msperc = convert(decimal(8, 2), 1E2*f1.ms/f2.ms - 100), lenprec = convert(decimal(8, 2), 1E2*f1.len/f2.len - 100) FROM #fix f1 LEFT JOIN #fix f2 ON f2.len = f1.len - @delta ORDER BY f1.len go DROP TABLE #fix
FIX$SINGLE_Str_UNPACK_test is the procedure that I used to test the unpack of strings for FIX$SINGLE in the main test (with the final SELECT removed). The use of dynamic SQL is a bit clunky, but I wanted to avoid using the same nvarchar(MAX) variable again, in case that would distort the results.
What the script produces is the execution time for each length, and how much the execution time and the list length increased in percent from the previous length. Here is the output from all three test machines:
Length | JAMIE4K | KESÄMETSÄ | SOMMERSKOV | ||||
---|---|---|---|---|---|---|---|
No of chars | % Chg | Time | % Chg | Time | % Chg | Time | % Chg |
60 000 | 156 | 30 | 16 | ||||
120 000 | 100 | 340 | 118 | 76 | 153 | 46 | 188 |
180 000 | 50 | 466 | 37 | 93 | 22 | 46 | 0 |
240 000 | 33 | 593 | 27 | 110 | 18 | 63 | 37 |
300 000 | 25 | 733 | 24 | 123 | 12 | 186 | 195 |
360 000 | 20 | 843 | 15 | 156 | 27 | 220 | 18 |
420 000 | 17 | 1 923 | 128 | 360 | 131 | 266 | 21 |
480 000 | 14 | 2 186 | 14 | 403 | 12 | 280 | 5 |
540 000 | 13 | 2 453 | 12 | 466 | 16 | 326 | 16 |
600 000 | 11 | 2 703 | 10 | 610 | 31 | 360 | 10 |
660 000 | 10 | 3 093 | 14 | 673 | 10 | 406 | 13 |
720 000 | 9 | 3 263 | 6 | 800 | 19 | 420 | 3 |
780 000 | 8 | 3 530 | 8 | 670 | -16 | 470 | 12 |
840 000 | 8 | 4 030 | 14 | 766 | 14 | 500 | 6 |
900 000 | 7 | 4 030 | 0 | 843 | 10 | 530 | 6 |
We can make two immediate observations:
(There are some other minor irregularities in these numbers. Please pay no attention to them. These numbers are no averages over 100 runs, but each length was only run a single time.)
There is one more observation, that was not equally immediate to me, but when SQL Server MVP Roy Harvey saw my script, he ran it a few times on his machine, and plotted this graph in Excel:
While the slope evens out after the jump, the slope is still steeper than before the jump.
I have run this script on a couple of more machines and the threshold appeared at 420 000 characters on all computers but one where it occurred at 300 000 characters. And all machines where 32-bit boxes – except that last one which was an x64 machine. Thus, it seems to depend on the processor type where this threshold appears. As for why it would be different, I don't know. And even less do I have an idea, why the threshold would be lower on x64! (Could it be something else, like memory? It doesn't seem so. One of the 32-bit machines that I ran this on was a virtual computer running VMware, with only 256 MB of memory. The threshold was still at 420 000 characters.)
Here I'd like to insert a request: I don't have access to an IA64 machine. If you have, I would be interested in seeing the results. Follow this link to get a complete version of the script that creates a database, the Numbers table, the stored procedure and the function.
As I discussed in the main article, this problem is not confined to the fixed-length method, but all methods that use T-SQL inline functions are subject to this phenomenon, whereas methods that apply chunking largely evades the problem, and the CLR and XML methods have their own handling of the long strings outside the realm of nvarchar(MAX). (This statement was also confirmed in my new suite of tests in 2009, where I tested lists with 50 000 elements.) As for the methods that uses dynamic SQL, including MANYSELECT, I have not examined this, but it does not appear very interesting anyway.
Nevertheless, as long as the input elements vary greatly in length, the fixed-length method is the first that will hit this limit, since input strings get longer. If you want to use the fixed-length method (e.g. because your input has fixed length) you could implement a chunking version of the fixed-length method, but with bigger chunks than 4 000. Or you could implement the fixed-length method in the CLR, as I did in the CLR$FIX method, see further below.
Two final notes: 1) In this section I've talked about characters. The actual limit is of course in bytes, so if you use varchar the thresholds are at 600 000 and 840 000 characters. 2) And of course the limits are somewhere in the spans 480 000 – 600 000 bytes and 720 000 – 840 000 bytes respectively. I did not find it interesting to track down the exact threshold points. Let's say that they are at 500 000 and 750 000 bytes.
The main body the text included a version for fixed-length that accepted a binary string. The idea was that for integers, this would be a faster method. Let's see how that worked out. I show the numbers for the unpack operation for all three machines, only for 2 000 and 10 000 list elements.
List Length |
JAMIE4K | KESÄMETSÄ | SOMMERSKOV | |||
---|---|---|---|---|---|---|
FIX$BINARY | $SINGLE | FIX$BINARY | $SINGLE | FIX$BINARY | $SINGLE | |
2 000 | 134 | 136 | 28 | 29 | 15 | 16 |
10 000 | 613 | 629 | 125 | 132 | 70 | 71 |
The results are not conclusive, you can't say any method is better than the other. (However, in the tests in 2009, FIX$BINARY did actually prove to be faster.)
I the main article, I discussed that using binary input could save network bandwidth, but keep in mind my tests did not cover that.
(One can note that if you use binary input, you can get quite a few more elements into the list before you hit that threshold at 500 000 – 750 000 bytes that I discussed in the previous section.)
I tested three different CLR methods: CLR$SPLIT which uses the Split function in the .Net Framework, CLR$ITER where I do the iteration myself, and CLR$FIX which works with fixed-length input. I show the result from all three machines, but only for the unpack of 10 000 list elements:
List Length |
Server | Str_UNPACK | Int_UNPACK | ||||
---|---|---|---|---|---|---|---|
CLR$SPLIT | $ITER | $FIX | CLR$SPLIT | $ITER | $FIX | ||
2 000 | JAMIE4K | 206 | 205 | 210 | 191 | 188 | 189 |
KESÄMETSÄ | 44 | 42 | 43 | 42 | 41 | 41 | |
SOMMERSKOV | 23 | 23 | 24 | 19 | 21 | 19 | |
10 000 | JAMIE4K | 993 | 973 | 1 010 | 912 | 899 | 903 |
KESÄMETSÄ | 213 | 208 | 215 | 203 | 198 | 197 | |
SOMMERSKOV | 110 | 109 | 115 | 97 | 99 | 95 |
As you see, the differences are nowhere close to dramatic. Rather they are utterly boring. CLR$ITER appears to be slightly faster than CLR$SPLIT, but the results are anything but conclusive.
The most interesting with these numbers is what does not happen with CLR$FIX: 1) It is not faster than the other methods – apparently the penalty for searching for delimiters is much smaller in the CLR than in T-SQL. 2) CLR$FIX does not fall behind the other methods for 10 000 elements on SOMMERSKOV. That is, we can see that the threshold we saw for fixed-length in T-SQL does not apply to the CLR. (The input to CLR$FIX is the same as to FIX$SINGLE & co.)
Performance-testing XML faced me with a problem. Normally, my test procedures receives an input parameter of the type nvarchar(MAX) and for all methods but XML that's uncontroversial. For XML this would be somewhat strange, because when you use XML in practice, you would of course accept a parameter of the new xml data type. Since this type is more complex, there is likely to be some overhead in receiving xml parameters. (Note that this type stores the XML document in a binary format, so full parsing is not required.) If I had started and stopped the timer in the test script before calling the stored procedure, this would not have been an issue. But since I did not want client-server communication to distort the results, I start and stop the timer in the test procedures. This left me with two options:
The only reasonable choice was to do both options to get the upper and lower boundaries for the actual cost for XML. The XMLATTR that you have already seen the results for, is the version that accepts an xml parameter. In this section, I will compare it to XMLATTR$STR, which is the same as XMLATTR except that the input is an nvarchar(MAX) parameter.
While I'm at it, we I also look at other methods using XML: XMLELEM uses nodes just like XMLATTR, but with element-centred XML. (XMLELEM takes an xml input parameter.) XMLATTR$OPEN and XMLELEM$OPEN use the old OPENXML function rather than nodes. They accept nvarchar(MAX) as input, because this is you would use OPENXML. For reference, I also include the numbers for TBLNUM$IL, "fastest of the slowest".
Here are the numbers for JAMIE4K and SOMMERSKOV for the unpack of integers; the numbers for a list of strings add nothing remarkable.
JAMIE4K | SOMMERSKOV | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Method | 20 | 200 | 650 | 2 000 | 10 000 | 20 | 200 | 650 | 2 000 | 10 000 | |
XMLATTR | 3 | 26 | 85 | 260 | 1253 | 1 | 4 | 11 | 35 | 169 | |
XMLATTR$STR | 4 | 32 | 105 | 324 | 1580 | 2 | 5 | 17 | 48 | 239 | |
TBLNUM$IL | 4 | 33 | 123 | 378 | 1850 | 1 | 3 | 14 | 47 | 230 | |
XMLELEM | 3 | 38 | 126 | 388 | 1898 | 1 | 6 | 18 | 55 | 272 | |
XMLATTR$OPEN | 8 | 50 | 229 | 802 | 4075 | 1 | 5 | 28 | 87 | 439 | |
XMLELEM$OPEN | 8 | 54 | 237 | 826 | 4198 | 1 | 6 | 28 | 91 | 458 |
We can see that the overhead for assigning the input string to an xml variable inside the procedure is by no means negligible. The overhead when we convert the XML document from a string is around 25 % on JAMIE4K and as much as 40 % on SOMMERSKOV. I did not include the numbers from KESÄMETSÄ, but the overhead is around 25 % there as well, so it may be that there is a difference between 32-bit and x64 here.
But we can also see that if we put in XMLATTR$STR in place of XMLATTR in the main results, it affects the ranking of the main results only marginally. XMLATTR ranked last of the fastest method in most cases, and with the string overhead, it still keeps a distance to TBLNUM$IL on JAMIE4K (and KESÄMETSÄ). On SOMMERSKOV the game is tighter. Assuming that the true numbers lies somewhere in between, it still seems appropriate to classify XML as a fast method.
The numbers for XMLELEM tells us that element-centred XML is 50-60 % slower than attribute-centred XML. It may be surprising that the difference is that big, and it is no less surprising, to me at least, that this is due to differences in the query plans. But the really big – and nasty – surprise was what happened when I ran the tests an SP2 instance of SQL 2005 on SOMMERSKOV. For most methods this had no significant effect on the results. Except for XMLELEM
List length | Str_UNPACK | Int_UNPACK | Str_JOIN | Int_JOIN |
---|---|---|---|---|
20 | 3 | 3 | 3 | 4 |
200 | 110 | 118 | 116 | 125 |
650 | 984 | 1 029 | 1 007 | 1 060 |
2 000 | 9 164 | 9 525 | 9 139 | 9 585 |
10 000 | 229 641 | 240 101 | 230 359 | 239 953 |
The results are nothing but a complete disaster. What it is all about is really a serious misjudgement from the optimizer. Already on SP1, the query plan for XMLELEM is more complex than for XMLATTR. (Since XQuery processing is something I'm not well versed in, I don't go into details.) But on SP2, the optimizer adds an Eager Spool operator. Often when you see an Eager Spool operator in your query plan, you know you are in trouble.
Thankfully, this does not always happen. The test procedure inserts the resulting data from the tested statement into a table, since I don't want client communication to distort the results. When I removed the INSERT, the Eager Spool did not appear, and performance was normal. I also found that adding a clustered index on the table, prevented the Eager Spool from appearing. That is, the problem appears only if you insert into a heap. It appears that this issue has been address in SQL 2005 SP3. As for SQL 2008 please see the main article.
Finally, we can see that using OPENXML does not perform very well. While it qualifies as "good enough", it's difficult to see why you should use OPENXML as it does not buy you anything that nodes does not. Unless, that is, you run into that optimizer bug in SP2, and you are already set on using XML, for instance because your aim to insert many values.
Notes added in 2010: There were several issues in this test:
<Root><Num><num>12</num></Num><Num><num>9898</num></Num>...It is likely that this lead to an over-estimation of the overhead for XMLELEM.
Some of the functions in the main article have return tables that include both a varchar and nvarchar column. In the test suite, I included both TBLNUM which returns only an nvarchar column, and TBLNUM$DUO which uses duo_chunk_split_me from the main article, and which returns both varchar and nvarchar. This table shows the results for the unpack of strings on all three servers for list length 650 and up:
Server | 650 | 2 000 | 10 000 | ||||||
---|---|---|---|---|---|---|---|---|---|
TBLNUM | $DUO | Ratio | TBLNUM | $DUO | Ratio | TBLNUM | $DUO | Ratio | |
JAMIE4K | 85 | 94 | 1.11 | 257 | 283 | 1.10 | 1 154 | 1 270 | 1.10 |
KESÄMETSÄ | 18 | 20 | 1.11 | 59 | 65 | 1.10 | 264 | 296 | 1.12 |
SOMMERSKOV | 11 | 12 | 1.09 | 32 | 38 | 1.19 | 147 | 167 | 1.14 |
TBLNUM$DUO comes with an overhead of 10-15 %. So there is a good argument for using a function that returns only one column. But keep in mind that if you work with both varchar and nvarchar, you need a function for both and you need to remember to use the right function when you join with an indexed string column, as I discussed in the section nvarchar vs. varchar in the main article.
The penalty in percent for dual return tables is likely to be lower for slower methods like the CTE and the iterative method.
The discovery that the convert of len(@param)
to int had
so dramatic effect on
inline_split_me, was one I made fairly late when working with this
article. Here is a comparison between the regular TBLNUM$IL and TBLNUM$NOCNV
which uses a version of inline_split_me without the convert, and
consequently does
not use a parallel plan. Here is data from all three machines, for the
integer operations:
Server | List length | Int_UNPACK | Int_JOIN | ||||
---|---|---|---|---|---|---|---|
TBLNUM$IL | $NOCNV | Ratio | TBLNUM$IL | $NOCNV | Ratio | ||
JAMIE4K | 20 | 4 | 4 | 1.00 | 13 | 6 | 0.46 |
200 | 33 | 34 | 1.03 | 54 | 44 | 0.81 | |
650 | 123 | 125 | 1.02 | 147 | 157 | 1.07 | |
2 000 | 378 | 385 | 1.02 | 418 | 490 | 1.17 | |
10 000 | 1 850 | 1 894 | 1.02 | 977 | 2 408 | 2.46 | |
KESÄMETSÄ | 20 | 1 | 1 | 1.00 | 3 | 1 | 0.33 |
200 | 8 | 8 | 1.00 | 11 | 9 | 0.82 | |
650 | 26 | 28 | 1.08 | 39 | 34 | 0.87 | |
2 000 | 91 | 91 | 1.00 | 128 | 114 | 0.89 | |
10 000 | 436 | 440 | 1.01 | 633 | 565 | 0.89 | |
SOMMERSKOV | 20 | 1 | 0 | 2 | 1 | 0.50 | |
200 | 3 | 4 | 1.33 | 7 | 6 | 0.86 | |
650 | 14 | 16 | 1.14 | 19 | 21 | 1.11 | |
2 000 | 47 | 48 | 1.02 | 55 | 69 | 1.25 | |
10 000 | 230 | 238 | 1.03 | 192 | 328 | 1.71 |
You can see that for the pure unpack, the effect of the type cast is miniscule. On the other hand, for the join, the effect is dramatic with enormous gains for longer lists on JAMIE4K and SOMMERSKOV. But it cuts both ways. On KESÄMETSÄ, the version without the type cast is always faster, because parallelism backfires here. And while accuracy is poor, there is an indication that parallelism only causes an overhead without returns for short list lengths.
inline_split_me lends itself to another interesting observation. The version that appears in the article was suggested to me by Brian W Perrin who had read my article for arrays and lists on SQL 2000 and found a modification that he thought would improve performance. For reference, here are both versions:
CREATE FUNCTION inline_split_me(@param nvarchar(MAX)) RETURNS TABLE AS RETURN(SELECT ltrim(rtrim(convert(nvarchar(4000), substring(@param, Number, charindex(N',' COLLATE Slovenian_BIN2, @param + N',', Number) - Number) ))) AS Value FROM Numbers WHERE Number <= convert(int, len(@param)) AND substring(N',' + @param, Number, 1) = N',' COLLATE Slovenian_BIN2 go CREATE FUNCTION inline_split_me_old(@param nvarchar(MAX)) RETURNS TABLE AS RETURN(SELECT ltrim(rtrim(convert(nvarchar(4000), substring(N',' + @param + N',', Number + 1, charindex(N',' COLLATE Slovenian_BIN2, N',' + @param + N',', Number + 1) - Number - 1) ))) AS Value FROM Numbers WHERE Number <= convert(int, len(N',' + @param + N',')) - 1 AND substring(N',' + @param + N',', Number, 1) = N',' COLLATE Slovenian_BIN2 go
As you see, the old version adds commas before and after the input parameter everywhere it is referenced. It certainly looks less efficient, but how much does it really matter? Here are the results for unpack on JAMIE4K:
List length | Str_UNPACK | Int_UNPACK | ||||
---|---|---|---|---|---|---|
TBLNUM$IL | TBLNUM$ILOLD | Ratio | TBLNUM$IL | TBLNUM$ILOLD | Ratio | |
20 | 6 | 6 | 1.00 | 4 | 5 | 1.25 |
200 | 42 | 43 | 1.02 | 33 | 35 | 1.06 |
650 | 156 | 159 | 1.02 | 123 | 126 | 1.02 |
2 000 | 486 | 499 | 1.03 | 378 | 389 | 1.03 |
10 000 | 2 384 | 3 890 | 1.63 | 1 850 | 1 895 | 1.02 |
In all cases but one the difference is so scarce that it can be disputed whether it is significant at all. But then there is one case where Brian's version really gives an enormous performance boost. Or rather the other way round, at 10 000 string elements, the old version apparently hits a threshold and execution time goes through the roof. What sort of threshold this might be I have not investigated, and neither I have investigated whether Brian's version also hits a threshold further up. But I think that the important observation is that there are more threshold issues with the MAX data types than just the one we saw for fixed length.
You may wonder what happens when we join, and get the parallel plan. For the fun of it, here are the numbers for KESÄMETSÄ (where the parallel plan is not really good):
List length | Str_JOIN | Int_JOIN | ||||
---|---|---|---|---|---|---|
TBLNUM$IL | TBLNUM$ILOLD | Ratio | TBLNUM$IL | TBLNUM$ILOLD | Ratio | |
20 | 4 | 4 | 1.00 | 3 | 2 | 0.67 |
200 | 17 | 18 | 1.06 | 11 | 11 | 1.00 |
650 | 62 | 64 | 1.03 | 39 | 40 | 1.03 |
2 000 | 205 | 208 | 1.01 | 128 | 131 | 1.02 |
10 000 | 977 | 1 022 | 1.05 | 633 | 658 | 1.04 |
The threshold goes away more or less entirely. The ratios for 10 000 string elements are somewhat higher on SOMMERSKOV and JAMIE4K, but it's far cry from what happens without parallelism. No, I have no idea why.
I like to give you a quick look at the methods for inserting many values. The main results included MANYSELECT which transforms the input string into an INSERT from an EXEC() of many SELECT statements. Let's also look at MANYUNION, where string is transformed to a big INSERT SELECT-UNION statement, and MANYINSERT where the transformation gives many INSERT VALUES statements. This is the results for the unpack operation on SOMMERSKOV:
Method | Str_UNPACK | Int_UNPACK | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
20 | 200 | 650 | 2 000 | 10 000 | 20 | 200 | 650 | 2 000 | 10 000 | ||
MANYSELECT | 3 | 15 | 44 | 136 | 904 | 2 | 9 | 24 | 76 | 391 | |
MANYUNION | 5 | 99 | 885 | 11 533 | 371 813 | 3 | 39 | 312 | 4 797 | 176 834 | |
MANYINSERT | 15 | 173 | 531 | 1 566 | 7 631 | 14 | 152 | 463 | 1 319 | 6 601 |
As you can see, both MANYINSERT and MANYUNION trail far far behind the more complex MANYSELECT. But it's nevertheless interesting to compare MANYINSERT and MANYUNION. For the shorter list lengths, UNION performs better than INSERT, but UNION has a dreadful development. When the list length ten-folds from 200 to 2 000, the execution time for UNION hundred-folds!
The answer surely lies in the optimizer. For the optimizer the long SELECT UNION statement is very complex, even if it's all constants. For shorter list lengths, UNION manages to beat INSERT, because INSERT has an overhead for writing one row at a time. But as the input size increases, the overhead for the optimizer makes you lose everything you gained on the roundabout.
There is one more thing to comment, one we already saw in the main results. For all these methods, the execution time for unpacking a list of strings is considerably higher than for integers. I would assume that this is due to query processing and parsing: when unpacking strings there are more characters and also more tokens (the single quotes) to process.
The test suite includes two set of test procedures for dynamic SQL, EXEC$A and EXEC$B, but apart from the name, they are identical. EXEC$B is simply EXEC$A run a second time with exactly the same input. This means that we can expect EXEC$B to find a plan for the dynamic SQL query in cache. This permits us to use EXEC$B to determine how much of EXEC$A that is compilation. This is the data from SOMMERSKOV. For reference, I also include the numbers for FIX$SINGLE
Method | Str_JOIN | Int_JOIN | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
20 | 200 | 650 | 2 000 | 10 000 | 20 | 200 | 650 | 2 000 | 10 000 | ||
EXEC$A | 5 | 42 | 139 | 499 | 4 267 | 4 | 23 | 86 | 333 | 3 518 | |
EXEC$B | 1 | 3 | 12 | 42 | 156 | 0 | 1 | 5 | 19 | 415 | |
FIX$SINGLE | 1 | 4 | 15 | 49 | 299 | 0 | 3 | 8 | 32 | 151 |
We find that EXEC$B is faster than anything we have looked at in this article. And of course, it should be. There is no overhead of a table of numbers, no need to pass the nvarchar(MAX) to the CLR and search the string. Just look up the values. But how often do you run the same list of values twice? EXEC$B cannot count as a real method. And it is very obvious that the cost with dynamic SQL lies with the compilation.
There is one more observation to make: for the "join" over 10 000 integer values, EXEC$B does not perform equally well, about in par with CTE$IL. Looking at the values for the individual runs, the very most of them are the in the range 340-375 ms, and then there are two runs exceeding 3 500 ms. That is, occasionally, the plan from EXEC$A is for some reason not around in the cache when EXEC$B runs.
But that only explains why the average is 415 ms and not 355 ms, which is still far behind FIX$SINGLE. The answer to this lies in the query plan. The optimizer builds an internal table of constants, akin to a temp table, and remember what happened with a temp table when joining over the integer column with 10 000 elements: there was a table scan + hash match, and this is what happens in this case as well. In the main article, I said that dynamic SQL could be a last resort if you had problems to get good performance with short lists, because dynamic SQL is the only way for the optimizer to have full knowledge. But you can see here why I made the qualification that the lists should be short. For 10 000 elements, the optimizer thinks for three seconds, and then comes up with a non-optimal plan.
What happens when the setting force parameterisation is in effect? I ran a separate test on SOMMERSKOV to find out, and to make it quick, I only had ten runs per method. It's perfectly enough to show the effect though:
Method | Str_JOIN | Int_JOIN | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
20 | 200 | 650 | 2 000 | 10 000 | 20 | 200 | 650 | 2 000 | 10 000 | ||
EXEC$A | 1 | 12 | 34 | 72 | 4142 | 1 | 4 | 16 | 56 | 3 399 | |
EXEC$B | 1 | 6 | 23 | 41 | 155 | 1 | 1 | 10 | 37 | 347 | |
FIX$SINGLE | 4 | 4 | 17 | 52 | 296 | 2 | 2 | 12 | 32 | 151 |
That was indeed the "go faster" switch. But you can see that EXEC$A does not beat FIX$SINGLE even with auto-parameterisation. For some reason the rerun in EXEC$B is still considerably faster, why I don't know. (Since other methods runs before EXEC$A with the same input, the data is already in cache, so that is not the reason.)
More importantly, the speed disappears completely for the longest lists. Since a batch or stored procedure in SQL Server cannot take more than 2 100 parameters, parameterisation does not happen when the input list has more elements.
The full results from the quick test with forced parameterisation included a few more methods, and you can view the results here.
Here is a list of all methods in the test, with a brief description. Click each method name to see the test procedures, and from there you can get to the tested functions.
Again, here are the links for the full results for all machines: JAMIE4K, KESÄMETSÄ and SOMMERSKOV.
CLR$FIX | Fixed-length format, implemented with the CLR. |
CLR$ITER | Rolling our own in the CLR. |
CLR$SPLIT | CLR using the split method. |
CTE$CHUNK | A chunked procedure using a recursive CTE. |
CTE$IL | cte_split_inline in the main article. |
CTE$MSTMT | A multi-statement function with the same contents as cte_split_inline. |
EXEC$A | Dynamic SQL. |
EXEC$B | Dynamic SQL run a second time. |
FIX$BINARY | fixbinary_single |
FIX$ITER | The iterative method with fixed-length input. |
FIX$MSTMT | A multi-statement version of fixstring_single. |
FIX$MULTI | fixstring_multi. |
FIX$MULTI2 | fixstring_multi2. |
FIX$SINGLE | fixstring_single. |
FIX$TMPTBL | fixstring_single with data bounced over a temp-table. |
ITER$CHUNK | The iterative method with chunks. |
ITER$SIMPLE | The iterative method without chunking. |
MANYINSERT | Making the list into many INSERT statements. |
MANYSELECT | Making the list into many SELECT statements. |
MANYUNION | Making the list into many SELECT UNION. |
REALSLOW | Using charindex. |
SLOW$LIKE | Using LIKE. |
TBLNUM | chunk_split_me, same as duo_chunk_split_me, but with a single return column. |
TBLNUM$DUO | duo_chunk_split_me. |
TBLNUM$IL | inline_split_me. |
TBLNUM$ILOLD | An older version of inline_split_me. |
TBLNUM$NOBIN | A version of inline_split_me without COLLATE clauses. |
TBLNUM$NOCNV |
inline_split_me where len(@param) is
not cast to int. |
XMLATTR | Attribute-centred XML with nodes and input as xml. |
XMLATTR$OPEN | Attribute-centred XML with OPENXML. |
XMLATTR$STR | Attribute-centred XML with nodes, and input as nvarchar(MAX). |
XMLATTR$TMPTBL | Same as XMLATTR, but results bounced over a temp table. |
XMLATTR$TVAR | Same as XMLATTR, but results bounced over a table variable. |
XMLELEM | Element-centred XML with nodes and input as xml. |
XMLELEM$OPEN | Element-centred XML with OPENXML. |
2010-10-10 – Updated the section on XML, since the tests on XML had some flaws.
2010-01-06 – There is now a second appendix with data from a new suite of tests on SQL 2008, to test new methods.
2008-12-20 – Added note that the issue with element-centred XML does not seem to appear in SQL 2005 SP3 and SQL 2008.
2007-03-03 – Initial version.