An SQL text by Erland Sommarskog, SQL Server
MVP. Most recent update 2016-12-18.
Copyright applies to this text. See here for font conventions used in this article.
If you want to import a file into SQL Server there are a number of options available:
The topic for this article is the last three. They are far from always the best choice, but there are situations where they can fulfil your needs very well. One distinct allure they have is that they can save you from dragging in another component like SSIS into your mix only because you need to import a handful of files.
The bulk-load tools have been in the product for a long time and they are showing their age. When they work for you, they are powerful. But you need to understand that these tools are binary to their heart, and they have no built-in rule that says that each line a file is a record – they don't even think in lines. You also need to understand that there are file formats they are not able to handle.
I have tried to arrange the material in this article so that if you have a simple problem, you only need to read the first two chapters after the introduction. I first introduce you them to their mindset, which is likely to be different from yours. Next I cover the basic options to use for every-day work. If you have a more complex file, you will need to use a format file and the next three chapters are for you. I first describe how format files work as such, and the next two chapters show how to use format files for common cases for import and export respectively. There is a brief interlude about a new feature in SQL Server vNext that turns the tables for CSV files before a chapter about Unicode files, including files encoded in UTF‑8. Then comes a chapter about "advanced" options, including how to load explicit values into an IDENTITY column. A short chapter covers permissions. The last chapter discusses XML format files, and I am not sorry at all if you give this chapter a blind eye – I find XML format files to be of dubious value.
Table of Contents
The command-line tool BCP has been in the product since version 1.0. The original purpose was surely to permit copying of data between SQL Server instances using native format (which is binary), and this is what BCP was designed for. Handling text files might have been slapped on as an after-thought. The command BULK INSERT was added in SQL Server 7 so that you could bulk-load files from inside SQL Server. SQL 2005 added the rowset function OPENROWSET (BULK) to permit you to query the data file as if it is a table.
While the three offer the same basic functionality they are different implementations. BCP uses ODBC, while the other two rely on OLE DB and the three don't behave identically. Many of the differences are with nitty-gritty details that you may never encounter or notice. But as a general tip: if you run into problem with one of them that you can't understand or you don't seem to able to work around, try one of the other in trio – maybe it works with that guy
To muddle the waters further, OPENROWSET(BULK) works differently depending on how you use it. It seems that if you say:
INSERT tbl (...) SELECT * FROM OPENROWSET (BULK...) AS tbl
The behaviour is more or less identical to BULK INSERT. But if you replace the star with an explicit column list or add a WHERE clause, there are changes in the behaviour. I will mention some of these differences in this article, but I'm not making any attempt to include a complete catalogue.
Note: there are also APIs available for bulk load. For a native application you would use the ODBC API, which is very close to what the BCP command offers. There is also a bulk-load API in OLE DB, but given that OLE DB is deprecated for access to SQL Server, this is nothing you should consider. For a .NET application there is the SqlBulkCopy class which is a yet another implementation of the bulk-load concept. However, I am not covering any of the APIs in this article.
This article is full of small example files. The contents of the files is displayed within the article, and they are also available in the archive bulkload.zip. If you want to play with the examples, you need to download and extract this archive. You cannot copy and paste from the text in the article, as you need to get the exact bytes. You can extract the files wherever you want, but C:\temp\bulkload is a good choice, since all example scripts in this article use this folder.
Except where otherwise noted, the examples in this article run on all versions from SQL 2005 and up, although I have mainly tested on SQL 2008 and SQL 2016.
All CREATE TABLE scripts are intended to be executed in tempdb. Some tables are reused from previous examples, and you may have to truncate them or drop and recreate them to get the result I present in the text. This is nothing I point out in the article, but I assume that you are able to figure out that part yourself.
A note about character encodings. In this text I will talk about 8-bit files as in contrast to Unicode files, although this is a bit of a misnomer. In a non-Unicode file for Swedish, English, Polish or Russian each character is encoded indeed in 8 bits. However, non-Unicode files for East Asian languages such as Chinese or Japanese use a "multi-byte character set" (as they are known in Windows), where only the letters from the Latin alphabet are encoded in 8 bits. Suspecting that many readers of this article will come from countries where the Latin and Cyrillic scripts dominate, I have opted overlook this fact. I hope that East Asian readers can accept my apology for this simplification.
Before we start to look at commands, options, format files etc, I want to directly go to the point which causes the most confusion and frustration when the naïve user tries to use the bulk-load tools: the conflicting mindsets. As humans when we look at a text file with data, we think of every line as a record to import, and since the bulk-load tool appears to have a concept of a record terminator, we expect the bulk-load tool to have the same perception.
But this is an illusion. As I mentioned, the initial purpose of BCP was to permit copying data between different SQL Server instances using a binary format. And while BCP, BULK INSERT and OPENROWSET(BULK) can work with text files, they still look at these text files as binary streams. When they read a file, they look at the definition for the first field, whereupon they read bytes until they have come to the end of that field. The second field is handled the same, and they jog along that way in a round-robin fashion. The only way they can register that they have reached the end of the record is by reading the last field. But if something that you think of as a record terminator appears in another field, the tools just take it to be part of the data.
Here is an example to illustrate how things can go wrong. Consider this table (created in tempdb):
CREATE TABLE x (A varchar(39) NULL, B varchar(39) NULL, C varchar(39) NULL, D varchar(39) NULL)
We have the data file data1.txt, which has one field less than there are columns in the table:
Alpha,beta,gamma A,B,C I,II,III 1,2,3
We load this file with any of the commands
BCP tempdb..x in data1.txt -T -c -t, -r\n
BULK INSERT x FROM 'C:\temp\bulkload\data1.txt' WITH (FIELDTERMINATOR = ',', ROWTERMINATOR='\n')
When looking at the data in grid mode in SSMS, we see this:
A B C D
----- ---- --------- ------
Alpha beta gamma A B,C
I II III 1 2,3
The file appears to have four records, but only two rows were added to the table. In text mode, we see four lines, but the data does not align with the headers:
A B C D
---------------------- ------------ ------------- ------------
Alpha beta gamma
I II III
The source of confusion there is that there are line breaks in the two values in column C, but in grid mode the two end-of-line characters, carriage-return and line-feed (CR-LF), appear as spaces.
This happened exactly because of what I said above. There are four columns in the table, why BCP and BULK INSERT assumed the file to have four fields: the first three terminated by comma and the last by CR-LF (but due to the quirks of history, specified only as
\n). Thus, they first read Alpha, found a comma and saved Alpha for column A. This was repeated with beta and column B. Next they read gamma. Still no comma, so they moved on past the line break and the A to find a comma and they saved gamma<CR><LF>A as the data for column C. When reading data for column D, commas were of no interest, now they only cared about the line break why they read B,C in one swoop. They proceeded in the same way with next two lines in the file – which for them was just the next couple of bytes. That is, they first found I, and then II. Still looking for a comma, they proceeded with III, consumed the line break and read the 1 until they found a comma to stop on. Again they switched to looking for a line-break and 2,3 was read as the data for the last row in the last column.
This may be confusing and bewildering, but if you plan to use these tools, you need to knock their way of thinking into your mind, because else you will never make friends with them.
BCP, BULK INSERT and OPENROWSET (BULK) all provide a multitude of options. Here I will give a short overview of the basic options that you need to load a file and specify the format without using a format file. As we move through the text, I will mention other options where they apply, and there is a later chapter that has an overview of most other (but not all) options. For the full syntax of these commands, please see the relevant topics in Books Online for BCP, BULK INSERT and OPENROWSET(BULK).
d. If the table name includes space or other characters special to the shell, you need to quote it. For instance:
BCP "Northwind..[Order Details]" in datafile.txt -T -n
in, but you can also specify
outto export a table,
queryoutto use a query as a source rather than a table for the export, and
formatto create a format file. BULK INSERT and OPENROWSET (BULK) can only be used for import and they do not have this option.
-Soption to specify the server, default is the default instance on the local machine. For authentication, use either
Tfor Windows authentication or
user ‑P pwdfor SQL authentication. There is no default, but you need to provide one of
-P. In the examples in this article, I use
Tand leave out
Sfor brevity, but you may need to change that for your environment. Since BULK INSERT and OPENROWSET(BULK) run from within SQL Server, authentication is not an issue with these two.
‑f. With BULK INSERT and OPENROWSET (BULK) you use the option FORMATFILE. This option is mandatory with OPENROWSET (BULK), except in the cases noted below.
char(8-bit text files),
widechar(Unicode text files),
widenative(both binary). The default is
char. With BCP this corresponds to the options
-Nrespectively. If you specify neither of them and no format file, BCP prompts you for each column in the table. (I will not cover this prompting session in this article.)
‑toption. The default field terminator with both is
\t, tab. The terminator can be up to 10 characters long. With BCP, keep in mind that if the terminator includes characters that are special to the shell, for instance the pipe character (
|), you need to put it in quotes. (See the example below.)
‑r, with BULK INSERT it is ROWTERMINATOR. The default for both is a regular Windows line-ending, that is, carriage return-line feed (CR-LF). The normal way to denote this would be
\r\n, however there is a quirk here: for legacy reasons
\nalone is interpreted as CR-LF. This presents a challenge when you want to import Unix files, since such files typically only have LF (
\n) as the line terminator. You can work around this by specifying the terminator as
0x0Aas in these examples:
BCP tempdb..x in unixfile.txt -c -t"|" -r0x0A -T BULK INSERT x FROM 'C:\temp\bulkload\unixfile.txt' WITH (FIELDTERMINATOR = '|', ROWTERMINATOR = '0x0A')A confession: it took this MVP about 20 years with the product to learn this workaround!
A note on terminology – rows vs. records. In this article, I use record to denote a complete set of fields in the data file. But the T‑SQL commands and Books Online use row instead which I find unfortunate for two reasons: 1) It causes confusion with tables, which is where you expect to find rows. 2) A row could be understood as a line in a text file, but as we have seen, these tools are binary to their heart and do not think in lines like humans do.
As I mentioned, we will look at more options as we move on. However, there is one more option I like to mention here and now, and that is the option to mandate checking of CHECK and FOREIGN KEY constraints. With BCP this option is
-h CHECK_CONSTRAINTS and with BULK INSERT it is CHECK_CONSTRAINTS. You should always include these options in your BCP commands and BULK INSERT statements for reasons I will explain in the chapter Other Bulk-Load Options. If you use OPENROWSET(BULK), this check is performed by default, why don't need this option.
Nevertheless, in the interest of brevity, I am not including these options in the examples in this article.
Many data files can be described with field and record terminators only, but not all files follow that pattern. A common case are files where one or more fields are quoted to protect field terminators that occur in the data. Here is an example, the file formatdemo.txt:
11,"Here is some text in quotes","2012-12-12" 12,"And this, is text with a comma","2013-09-02"
If you attempt to import this file by specifying comma as the sole field terminator, this fails. The bulk-load tools have no built-in understanding of CSV files and they will misread the second record, thinking that the data of the second field is "And this. To import this file correctly, you need to use a format file. There are two types of format files: old-style format files and the newer XML format files. I much prefer the former, and the article focuses on them. Only in the last chapter, I will discuss XML format files.
HOT OFF THE PRESS! A lot of the discussion (but not all) below is about comma-separated files, and with the release of CTP 1.1 of SQL Server v.Next the day before I type this, the playfield for CSV files changed entirely, since BULK INSERT and OPENROWSET(BULK) now has built-in support for CSV. See the chapter Built-in Support for CSV Files in SQL Server vNext.
Assume that we want to import the file above into this table:
CREATE TABLE formatdemo (idcol int NOT NULL, textcol varchar(50) NOT NULL, datecol datetime NULL)
Here is a format file, formatdemo.fmt, for the file above:
9.0 3 1 SQLCHAR 0 0 ",\"" 1 idcol "" 2 SQLCHAR 0 0 "\",\"" 2 textcol Latin1_General_CI_AS 3 SQLCHAR 0 0 "\"\r\n" 3 datecol ""
To load the file into the tables you would use any of these commands:
BCP tempdb..formatdemo in formatdemo.txt -f formatdemo.fmt -T BULK INSERT formatdemo FROM 'C:\temp\bulkload\formatdemo.txt' WITH (FORMATFILE = 'C:\temp\bulkload\formatdemo.fmt')
In the format file there are two header lines followed by a number of lines that each describes a field in the file. Each such line contains eight different properties for the field, separated by white space. Here is a quick summary of the properties:
In the following, we will look closer at these properties. For completeness, I should say that there is a ninth property, the data type to use in SQL Server, which is only available when you use XML format files. We will look at this property in the last chapter of the article.
The first header line specifies the version of the format-file format. Above I have used 9.0, which is the same as SQL 2005. There is not really any difference between the formats from 8.0 and up, but if you use a version number that is higher than your version of BCP or BULK INSERT, you get an error message. (Which for BCP confusingly talks about XML.) Lower versions of the format are always accepted, why using 9.0 seems like a good idea.
The second header line specifies how many fields there are per record in the file.
The first field property is the field number. You should always list the fields in order. From what I can tell, the tools enforce that the numbers are in order and consecutive, and that you have listed at least as many fields as you said on the second line. Beware though, that extraneous field descriptions are ignored. This can lead to confusion if you add a field description, but forget to change the second header line.
The second property is the data type in the source file, not in the table. For a text file you should always specify SQLCHAR for all fields, unless you have a Unicode file in in UTF‑16 encoding in which case you should use SQLNCHAR. You should never use types like SQLINT, SQLDATETIME etc. as they apply to binary data only.
When the bulk-load tools read a field, they need to know when to stop reading bytes for that field. This information is conveyed in the properties three to five: prefix length, fixed length and field terminator in that order. The latter is what you most commonly will use with a text file, so we will start there.
The field terminator is a string of one to ten characters enclosed in double quotes. If the terminator includes a double quote, you need to escape it with a backslash, as we saw in formatdemo.fmt above. There are a few more escape sequences you need to know:
\\ is for the backslash itself,
\t = tab,
\r = carriage return (CR) and
\n = line-feed (LF). Note that in format files there is no special interpretation of
\n as two characters; this only happens with the options FIELDTERMINATOR and
-t. Thus, to specify a regular Windows line-break as the field terminator, you use
Occasionally, you have files where the fields have a fixed width. Assume this table:
CREATE TABLE fixedlength (x varchar(14) NOT NULL, y int NOT NULL, z varchar(20) NOT NULL, w char(1) NOT NULL)
Here is a sample file, fixedlength.txt:
First field 12345678Third field Y Second record 4711End of 2nd recordN Third record 15The value is 15 Y
To load this file, you would use a format file where you specify the field widths in the fourth property in the format file. Here is one possibility, fixedlength.fmt:
9.0 4 1 SQLCHAR 0 14 "" 1 a Latin1_General_CI_AS 2 SQLCHAR 0 8 "" 2 b "" 3 SQLCHAR 0 17 "" 3 c Latin1_General_CI_AS 4 SQLCHAR 0 0 "\r\n" 4 d Latin1_General_CI_AS
Note here that only the first three fields have fixed length; the last is terminated by CR‑LF. The line ending can be handled in multiple ways, and I will present alternatives as we move on.
If you look closer, you see that the format file specifies a length of 17 for the third column, while the column is defined in the table as varchar(20). This is alright. The format file specifies the layout of the file, which does not have to match the table exactly. However, since the trailing spaces are saved into the table, the length of the field in the file cannot exceed the length of the table column, or the import will die with a truncation error.
If you specify both length and terminator, BULK INSERT and OPENROWSET (BULK) read data until they find a terminator, but if the length in the format file is exceeded, they produce an error message, why the length serves as an assertion. BCP, on the other hand, appears to ignore the length entirely when you have specified a terminator.
The third property in the format file, the prefix length, applies only to binary files. The prefix is a 1-, 2- or 4-byte number that gives the length of the value it precedes and the property specifies the length of the prefix and the only permitted values are 0, 1, 2 and 4.
To illustrate, here is an example of little practical use. We have this data file prefixlen.txt which do not have a CR-LF at the end:
!This text has 33 characters in it#And here are 35 with no line breaks
We have this one-column table:
CREATE TABLE prefixlen(a varchar(50) NOT NULL)
To load prefixlen.txt, we use the format file prefixlen.fmt with a prefix length of 1:
9.0 1 1 SQLCHAR 1 0 "" 1 col Latin1_General_CI_AS
This results in two rows with the values This text has 33 characters in it and And here are 35 with no line breaks. To understand what is happening, you should know that the ASCII codes for exclamation mark and the number sign are 33 and 35 respectively. That is, while the prefixes in this particular examples are displayed as characters they still work as numbers.
But as I said, prefix length is for binary files, so as long as you work with text files, the prefix-length property should always be 0.
Property six is the number of the column in the target table which the field should be imported into. Column numbers start on 1, and 0 means that the field is not to be imported. The seventh property is the column name, but this property is only informational with BCP and BULK INSERT. For OPENROWSET (BULK), property seven provides the column name when you select from the file.
To see this in practice, let us return to the fixed-length example in the previous section. Here is fixedlength2.fmt, which is an alternate way to write the format file:
9.0 5 1 SQLCHAR 0 14 "" 1 a Latin1_General_CI_AS 2 SQLCHAR 0 8 "" 2 b "" 3 SQLCHAR 0 17 "" 3 c Latin1_General_CI_AS 4 SQLCHAR 0 1 "" 4 d Latin1_General_CI_AS 5 SQLCHAR 0 0 "\r\n" 0 "" ""
The difference is that the fourth field is now of fixed length like the others, and instead I describe the line ending with a fifth field that I don't import (and I expect it to always be empty). I find this approach for fixed-length files a little cleaner.
You may note that the column names in the table do not match the names in the format file. Nevertheless, this command imports the file just fine:
BULK INSERT fixedlength from 'C:\temp\bulkload\fixedlength.txt' WITH (FORMATFILE = 'C:\temp\bulkload\fixedlength2.fmt')
Best practice is of course use the actual column names, but just be aware of that the names do not mean anything, unless you are using OPENROWSET (BULK):
SELECT * FROM OPENROWSET(BULK 'C:\temp\bulkload\fixedlength.txt', FORMATFILE = 'C:\temp\bulkload\fixedlength2.fmt') AS tbl
The output is:
a b c d
-------------- -------- ----------------- ----
First field 12345678 Third field Y
Second record 4711 End of 2nd record N
Third record 15 The value is 15 Y
There is nothing that says that the order of the columns of table much match the order of the fields in the file. Nor do all columns in the table have to have a matching field in the file. (But columns you leave out must obviously be nullable or have a default value.) As an example, we rearrange our test table a little:
CREATE TABLE fixedlength3 (a date NOT NULL DEFAULT sysdatetime(), w char(1) NOT NULL, x varchar(14) NOT NULL, z varchar(20) NOT NULL, y int NOT NULL)
We can load our data file with the format file fixedlength3.fmt which illustrates that column order and field order may be different:
9.0 5 1 SQLCHAR 0 14 "" 3 x Latin1_General_CI_AS 2 SQLCHAR 0 8 "" 5 y "" 3 SQLCHAR 0 17 "" 4 z Latin1_General_CI_AS 4 SQLCHAR 0 1 "" 2 w Latin1_General_CI_AS 5 SQLCHAR 0 0 "\r\n" 0 "" ""
When you work with column numbers, you need to be careful, as the bulk-load tools are not very good at detecting errors. If you specify the same column number twice, BULK INSERT and OPENROWSET (BULK) appear to detect this condition and produce an error message, but BCP stays silent and you get what you get. On the other hand, if you specify a column number beyond what's in the table, BULK INSERT and OPENROWSET (BULK) accept this silently, whereas BCP gives an error message, albeit it is very cryptic: Invalid Descriptor Index.
The last property in the format file is the collation which serves as a somewhat roundabout way to specify the code page for 8-bit files. This determines how the bulk-load tools interpret characters with code points outside the ASCII range, (that is, from code point 128 and beyond).
To illustrate, let's take this format file, collation.fmt, which is a bit contrived – you would never have a format file like this in real life.
9.0 4 1 SQLCHAR 0 0 "|" 1 a "" 2 SQLCHAR 0 0 "|" 2 b Latin1_General_CI_AS 3 SQLCHAR 0 0 "|" 3 c Polish_CS_AS 4 SQLCHAR 0 0 "\r\n" 4 d Greek_CS_AS
We have this table:
CREATE TABLE collationtest (a char(9) COLLATE Finnish_Swedish_CS_AS NOT NULL, b char(9) COLLATE Finnish_Swedish_CS_AS NOT NULL, c char(9) COLLATE Finnish_Swedish_CS_AS NOT NULL, d char(9) COLLATE Finnish_Swedish_CS_AS NOT NULL)
There is this data file, collation.txt, which holds place names in Norway, Denmark, Sweden and Iceland respectively.
Tromsø|Tromsø|Tromsø|Tromsø Næstved|Næstved|Næstved|Næstved Årjäng|Årjäng|Årjäng|Årjäng Þingvalla|Þingvalla|Þingvalla|Þingvalla
When I loaded this file using collation.fmt, I got this result (for reasons that follow, you may or may not get exactly the same result.)
a b c d
--------- --------- --------- ---------
Troms° Tromsø Tromsr Troms?
Nµstved Næstved Ncstved N?stved
+rjõng Årjäng Lrjäng ?rjdng
Ìingvalla Þingvalla Tingvalla ?ingvalla
As you see, in columns a, c and d, the data has been garbled, while the data is correct in column b. When you have a non-Unicode file, the file should adhere to a code page which defines the character repettoire and the code points for the characters. For regular Windows application, Windows uses an ANSI code page that fits the regional settings. In Western Europe and Americas, the code page is typically 1252. Other examples are 1251 for Eastern Europe and 1253 for Greek. If you want to see which ANSI code page your machine has, you can look at HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP in the registry.
A collation in SQL Server is tied to a code page, and you can use the function collationproperty to review the code page for a certain collation. As long as you are only loading data, the code page is all that matters with the collation. (If you use OPENROWSET(BULK) to query a file and sort the data or similar, other properties of the collation also come into play.) The code page for Finnish_Swedish_CS_AS and Latin1_General_CI_AS is 1252 for both, while the code page for Polish_CS_AS is 1251 and for Greek_CS_AS it is 1253.
When BCP reads the Å in Årjäng, is actually reads a byte with the value 197 and when the format file says that the code page is 1251, this is interpreted as Ĺ (capital L with an acute accent). This character is not present in code page 1252, so as a fallback it is replaced by an unaccented L. When the code page is said to be Greek, Å is replaced by a question mark (?) which is the generic fallback character, when nothing else fits.
What happens in the first case, when we do not define any collation at all? Windows also has a so-called OEM code page that is used in the command-line window which goes back to the days of MS-DOS and OS/2. And because BCP is a command-line tool of that era, it assumes that when no collation is specified that the data is in the OEM code page. To be compatible with BCP, BULK INSERT and OPENROWSET (BULK) have the same default. My OEM code page is 850 which has (roughly) the same character repertoire as 1252, but the non-ASCII characters do not have the same code points, why the data is misinterpreted and ends up as garbled in the table. (To see what OEM code page you have, look the value OEMCP in the same place in the registry as above.)
The lesson here is that you should never leave the collation field empty for character data, because it is not very likely that you will actually have data in the OEM code page. Alternatively, you can specify the character set with an option to the bulk-load command to override the default. With BCP this option is
-C and with BULK INSERT and OPENROWSET(BULK) the option is CODEPAGE. We run this:
bcp tempdb..collationtest in collation.txt -f collation.fmt -T -C 1252
Now the first column, a, is loaded correctly, but columns c and d are not, since the collation in the format file takes precedence over the command line.
Rather than specifying the code page by number, you can say ACP to state that the file is in your ANSI code page. While this saves you from knowing the code-page number, it could lead to surprises if computers in your organisation have different ANSI code pages and you move the load operation around.
Note: Here is an example of how OPENROWSET(BULK) works differently depending on how it is used. If I run
INSERT collationtest(a, b, c, d) SELECT * FROM OPENROWSET(BULK 'C:\temp\bulkload\collation.txt', FORMATFILE = 'C:\temp\bulkload\collation.fmt') AS tbl
I get the same result is as above. But if I run only the SELECT or add a WHERE 1 = 1, columns c and d are not garbled, while column a still is. I have not been able to understand why this happens.
The bulk-load tools are very rigid with the format of the format file. If you insert blank lines for legibility, BULK INSERT and OPENROWSET (BULK) appear to cope with it, but not BCP will not.
If you don't specify the collation, you must still mark the property with
""; you cannot leave it out entirely.
There must be a final CR-LF after the line which defines the last field in the file or your reward will be a cryptic error. For instance, BCP says I/O error while reading BCP format file.
Overall, errors from the bulk-load tools are often obscure and very difficult to understand.
We have now learnt the meaning of the properties of the format files. In this chapter we will look at how we can use format files to import data into SQL Server, and in the next we will use them for data export.
If you have a wide table with many columns, you may sigh at the thought of authoring a format file. But if all you need is a small tweak, there is a shortcut you can take: you can use BCP to generate a format file that you can start from. For instance:
bcp tempdb.sys.objects format nul -f sysobjects.fmt -c -t, -T
This command generates a format file for sometable where all fields but the last has a comma as the field separator. The last has the regular \
r\n. The key here is the
format option in the second argument and
-f to specify the name of the format file. If you rather would like an XML format file, use
-x in place of
You may wonder about
nul, the third argument to BCP here. In Windows nul is the name of the null device; data written here ends up in the bit bucket. You can specify whatever you like, since BCP will not read nor write any data file, but the argument must be there.
I like to add that if all you want to do is to copy data from one SQL Server instance to another, it is a very good idea to start with generating a format file, using the
-n option for native format. (When using the bulk-load tools to copy data from one SQL Server to another, there is little reason to use text files.) This avoids surprises when loading the data on the target instance, not the least if collations are different. Use the format file both when exporting and when importing the data.
We have already learnt how to import a file with quoted fields, but what if the first field is quoted? Look at the file initialquote.txt:
"In this file, the first field is quoted",12,"Middle field","Last field" "One more quoted field",99,"Next middle field","Final last field"
We want to load the file into the table below:
CREATE TABLE initialquote (firstcol varchar(50) NOT NULL, intcol int NOT NULL, midcol varchar(50) NOT NULL, lastcol varchar(50) NOT NULL)
The challenge is how to load the data without loading any of the quotes. We know that we can write a format file where we include the quotes in the field terminator. But how do we deal with the initial quote? I first learnt this trick from SQL Server MVP Linda Wierzbicki: we add an initial dummy field with double quote as the field terminator to the format file. Here is initialquote.fmt:
9.0 5 1 SQLCHAR 0 0 "\"" 0 "" "" 2 SQLCHAR 0 0 "\"," 1 firstcol Latin1_General_CI_AS 3 SQLCHAR 0 0 ",\"" 2 intcol Latin1_General_CI_AS 4 SQLCHAR 0 0 "\",\"" 3 midcol Latin1_General_CI_AS 5 SQLCHAR 0 0 "\"\r\n" 4 lastcol Latin1_General_CI_AS
It may seem bewildering that there are five fields in the format file, when there are four columns in the table and four fields in the data file. But the data of this field is always the empty string, and, as testified by the zero in the sixth property, it is not imported into the table.
Sometimes you have a file with a header you don't want to include. Here is a simple case, the file skipheader1.txt:
co1,col2,col3,col4 11,21,31,41 98,23,34,87
We want to load the file into this table:
CREATE TABLE skipheader1 (col1 int NOT NULL, col2 int NOT NULL, col3 int NOT NULL, col4 int NOT NULL)
You don't need a format file here: you can use the option
-F with BCP or FIRSTROW with BULK INSERT and OPENROWSET (BULK) to skip the header line. For instance:
BCP tempdb..skipheader1 in skipheader1.txt -c -t, -T -F 2
BULK INSERT skipheader1 FROM 'C:\temp\bulkload\skipheader1.txt' WITH (FIELDTERMINATOR = ',', FIRSTROW = 2)
But life is not always that simple. A challenge with header lines is that they do not always follow the pattern of the data records. The typical example is that while one or more data fields are quoted, the headers are not. In this case FIRSTROW = 2 will not do, because the tools read field by field, and they have no understanding of what is a header. FIRSTROW = 2 simply means "don't import the first record". And that is what the bulk-load tool think is a record. That record must still conform to the specified format.
There are two cases here, one simple and one difficult. The simple case is when you are so lucky that the first field is quoted. Let's look at initialquote2.txt:
firstfield,num,midfield,lastfield "In this file, the first field is quoted",12,"Middle field","Last field" "One more quoted field",99,"Next middle field","Final last field"
We handle this file the same way as you handled initialquote.txt above. That is, we use exactly the same format file and command:
bcp tempdb..initialquote in initialquote2.txt -T -f initialquote.fmt
Note that there is no
-F 2 here to skip the headers. What happens is that the header is included in the dummy field for the first record. Recall that to find the first field of the first record, the bulk-load tools read all bytes until they find that terminating double quote. This is a typical example when the notion differs between the human mindset and the mindset of the bulk-load tools. And for once to your advantage.
If the first field is not quoted, it becomes a lot more difficult. We have this table:
CREATE TABLE skipheader3 (idcol int NOT NULL, quotetext varchar(40) NOT NULL, adate datetime NOT NULL)
And then we have the data file skipheader3.txt:
idcol,quotedtext,a-date 11,"Here is some text in quotes",2012-12-12 12,"And this, is text with a comma",2013-09-02
Had there been no header, we would have defined the first field to have the terminator
,". But now the header is there and with that terminator, the bulk-load tools will read the data for the first field in the first record to be idcol,quotedtext,a-data<CR><LF>11, and this will not end well. One way is to give up and load the data with the quotes into a staging table with a single column and then parse the string, which is not very fun at all.
There is a possible trick though: a line-break in Windows consists of two characters CR and LF, and what says that they must be part of the same terminator? We can say that the first field which we don't import ends with LF and that the last field ends with CR only. This is skipheader3.fmt:
9.0 4 1 SQLCHAR 0 0 "\n" 0 "" "" 2 SQLCHAR 0 0 ",\"" 1 idcol "" 3 SQLCHAR 0 0 "\"," 2 quotedtext Latin1_General_CI_AS 4 SQLCHAR 0 0 "\r" 3 a-date ""
For the first record, the first field is the header, and for remaining records the first field is an empty string between the CR the LF. A consequence of this solution is that the file will end with an incomplete record which only contains this first empty field. BCP does not fuzz about this, but loads the file successfully. However, BULK INSERT and OPENROWSET (BULK) produce the error message Bulk load: An unexpected end of file was encountered in the data file and they do not load any rows.
You can work around this by using the option LASTROW:
BULK INSERT skipheader3 FROM 'C:\temp\bulkload\skipheader3.txt' WITH (FORMATFILE = 'C:\temp\bulkload\skipheader3.fmt', LASTROW = 2)
This example assumes that you know beforehand how many records there are in the file and typically you don't. But you can determine this with a separate query over the file:
DECLARE @lastrow int SELECT @lastrow = COUNT(*) - 1 FROM OPENROWSET(BULK 'C:\temp\bulkload\skipheader3.txt', FORMATFILE = 'C:\temp\bulkload\onecol.fmt') AS T
Note that this uses a different format file, onecol.fmt:
9.0 1 1 SQLCHAR 0 0 "\r\n" 1 col1 "" ""
With this format file, the header line counts as a record like any other.
Since BULK INSERT does not accept variables for any parameter, you would need to use dynamic SQL to build the statement to use the value returned in @lastrow. However, it is common to have to use dynamic SQL for BULK INSERT anyway, since often the name of the data file is not known until run-time.
The BATCHSIZE option may seem to offer a second workaround, but this is an inferior choice. We will look at this option in a later section.
If the file is a Unix file with only LF for end-of-line, this trick is not possible. And unless there is some other lucky circumstance, you fall down to the next section, the impossible cases.
When you save a spreadsheet in CSV format, Excel only quotes fields when needed, and you may get a file like spreadsheet.txt:
1;This text does not include a comma;2012-08-09 2;But in this text, there is a comma;2013-02-28 3;"And in this text; there is a semicolon";2013-09-08 4;"Part of this text is ""quoted""";
(I created this file from Excel on my computer where semicolon is the CSV separator according to my regional settings.) If you are on SQL 2016 or earlier, there is no way you can import this file with the bulk-load tools, field by field. They have no notion of optional delimiters. You can import the file into a one-column table and then parse the lines, but parsing is better done outside SQL Server. As it happens, my article Using Table-Valued Parameters in SQL Sever and .NET, includes an example of how to load such CSV files from .NET. However, all this changed with the release of SQL Server v.Next CTP 1.1, see the chapter Built-in Support for CSV Files in SQL Server vNext.
No matter the SQL Server version, jagged files are dead in the water:
Mandatory1,Mandatory2,Mandatory3,Option1,Option2,Option3 M1,M2,M3,O1 M1,M2,M3,O1,O2
That is, whatever tool or person who produced this file did not add delimiters for the fields that were left out. What is obvious to a human is not obvious to the bulk-load tools and their binary mindset. When they have read O1 on line 2, they will look for data for the field Option2 and read until they find a comma. For the bulk-load tools, the record terminator is just the terminator for the last field. And there is nothing that says that this field has a terminator at all; as far as the bulk-load tools are concerned, it could just as well be fixed-length or have a prefix length. Again, you could opt to import the file into a one-column table and work from there, but there is all reason to consider alternative solutions before you ride too far on Kludge Avenue.
We will now explore how we can use format files when exporting data with BCP, which is the only bulk-load tool to permit that operation. The same sort of mapping applies when you export data. That is, the column number in the sixth property specifies that this column goes to the field described on this line with the one difference that the column number cannot be zero. The collation property is as important as ever for string data; if you leave it empty, text will be converted to your OEM character set, which is unlikely to be what you want.
If you want string data to be quoted in your export file, you can use a format file to apply the quotes. As long as the first field in the file is not to be quoted, the same format file can be used for both import and export. Thus, going back to the first example with formatdemo table, this works:
bcp tempdb..formatdemo out formatdemoout.txt -T -f formatdemo.fmt
If you compare formatdemoout.txt with formatdemo.txt, you will find that they are identical, save for the fact that the time portion has been added to the date, since the column is datetime in the table.
If the first column is a string column to be quoted, we need to do like we did on import: add a dummy field with a double quote as terminator. But this time it has to be added to the data we export and we can use
queryout option to add an extra NULL column to the result set. We also need a format file. You may recall that in the format file initialquote.fmt, we had a field which we did not import. For export we need a different file. This is initialquoteout.fmt:
9.0 5 1 SQLCHAR 0 0 "\"" 1 "" "" 2 SQLCHAR 0 0 "\"," 2 firstcol Latin1_General_CI_AS 3 SQLCHAR 0 0 ",\"" 3 intcol Latin1_General_CI_AS 4 SQLCHAR 0 0 "\",\"" 4 midcol Latin1_General_CI_AS 5 SQLCHAR 0 0 "\"\r\n" 5 lastcol Latin1_General_CI_AS
The only difference to initalquote.fmt is that the sixth property in the field description reads 1-2-3-4-5 and not 0-1-2-3-4. Putting this altogether, we get this command line:
bcp "SELECT NULL,* FROM tempdb..initialquote" queryout initialquoteout.txt -T -f initialquoteout.fmt
Note: For readability reasons I have broken up this command over two lines, but recall that in the command-line window a command must be on a single line.
Note: if you are using BCP from SQL 2012 or later, the above BCP command produces this message: Warning: BCP import with a format file will convert empty strings in delimited columns to NULL. I cannot really see the point in this warning, but just ignore it. Even if it first says error, it is a warning, and the command runs successfully.
If you compare initialquote.txt and initialquoteout.txt, you will find the they are binary identical.
In both these examples, the assumption is that the data should always be quoted. If you only want to quote when needed, you will need to handle this in your query, which is outside the scope of this article. All I can say is: good luck. Or more directly: avoid it if you can.
Every now and then on the forums I see people who ask how they can add column headers to the file they are creating with BCP. This usually results in a suggestion with using a UNION query, but this has all sorts of problems. If there are non-string columns in the data – and that is likely to be the case – you get a type clash, which you need to resolve by converting all data to strings in the query. If there is data to be quoted, but the headers are not, you will need to add the quotes in your query and not in the format file. On top of all, there is not really any guarantee that the header will come first in the file, unless you have an ORDER BY to mandate that, and getting a column to order by will require one more set of nested queries.
No, I am not going to show how to do the above, because there is a simpler solution, and you have already seen the outline of it. Put the headers in an initial dummy column which is NULL for all rows but the first. Again, this is a case where need to leave the mindset of "row in a table" = "line in the output file". But it should be said: although simpler than the UNION query, it's still a bit messy. Not the least due to the fact that since the BCP command must be on a single line, it does not get pretty.
As when skipping headers on import, there are two cases depending on whether the first line in the file is to be quoted or not. Let's look at exporting the table initialquote again, but this time with the same headers as we had on input. We don't need any new format file; initialquoteout.fmt serves us well. The only difference is that the query needs to crafted differently. For readability, here is the query in normal SQL formatting:
SELECT CASE row_number() OVER(ORDER BY(SELECT 1)) WHEN 1 THEN 'firstfield,num,midfield,lastfield' + char(13) + char(10) END, * FROM tempdb..initialquote ORDER BY 1 DESC
That is, we add a column where we use row_number and only if row_number is 1 we add the header line, which beside the field names line also includes CR-LF for the line break. Unless there is some requirement that the export file should be sorted, there is no reason to mandate a certain order for the row-numbering why we order by SELECT 1. (The ORDER BY clause is mandatory, so you need to use something like this to decline ordering.) What is important is the final ORDER BY clause. ORDER BY 1 DESC, means that we order by the first column in descending order. Since NULL sorts first in SQL Server, this means that the only row with a non-NULL value in this column will sort first when we use DESC. That is, the row with the header added to it. (You may find that you get the headers in the desired place without ORDER BY 1 DESC, but keep in mind that would be mere chance. Without an explicit ORDER BY, there are no guarantees.)
This is how the BCP command looks like, split up over several lines to fit into the browser window:
bcp "SELECT CASE row_number() OVER(ORDER BY(SELECT 1)) WHEN 1 THEN 'firstfield,num,midfield,lastfield' + char(13) + char(10) END, * FROM tempdb..initialquote ORDER BY 1 DESC" queryout initialquoteout2.txt -T -f initialquoteout.fmt
That is, to run this you need to make this one single line. The output is binary identical to initialquote2.txt.
If the first field is not to be quoted, the technique in the query is the same, but we need to make a change in the format file. This time we don't need any quote as terminator, so what do we specify? Answer: nothing. We go back to the table skipheader3, and for the export there is the format file skipheader3out.fmt:
9.0 4 1 SQLCHAR 0 0 "" 1 header Latin1_General_CI_AS 2 SQLCHAR 0 0 ",\"" 2 idcol "" 3 SQLCHAR 0 0 "\"," 3 quotedtext Latin1_General_CI_AS 4 SQLCHAR 0 0 "\r\n" 4 a-date ""
This file has quite a few differences from the original skipheader3.fmt. As you may recall, this was the table where we split up CR-LF, by having LF as the terminator for the first field. No such trickery here. The only thing which is peculiar to the terminators is that the first field does not have any. Or a length or a prefix length for that matter. As with initialquoteout.fmt, the column numbers in the sixth property changes to start on 1, since this is an export file. I have also added a column name for the header, but that is merely an aesthetic matter. More important is that I have added a collation for the header to avoid undesired character-set conversion. (The observant reader may note that this applies to initialquoteout.fmt as well when we use it to add headers. I cheated a little there.)
Here is the BCP command (again, recall that this needs to be a single line when you run it):
bcp "SELECT CASE row_number() OVER(ORDER BY(SELECT 1)) WHEN 1 THEN 'idcol,quotedtext,a-date' + char(13) + char(10) END,* FROM tempdb..skipheader3 ORDER BY 1 DESC" queryout skipheader3out.txt -T -f skipheader3out.fmt
This time the output file is different from the original data file, skipheader3.txt, but that is only because the adate column is a datetime column, so a time portion has been added.
Before I close this section, I should say that there is one situation when this technique falls flat: there is no data to export, but the requirement is that you should nevertheless create a file with a header. The best choice in that case may be to detect that the file or table is empty and then write a file with the headers using ECHO.
From what we have seen this far, it seems that for files with a fixed-length format, we need a format file. But that is not entirely true. In the example we looked at previously, some of the fields in the file are shorter the table columns, and in this case, the format file cannot be evaded. But if there is a perfect match, there is an alternative.
This was our table for testing fixed length:
CREATE TABLE fixedlength (x varchar(14) NOT NULL, y int NOT NULL, z varchar(20) NOT NULL, w char(1) NOT NULL)
And then we have a different data file, fixedlength4.txt:
First field 12345678Third field Y Second record 4711End of 2nd record N Third record 15The value is 15 Y
The difference to the previous file is that the second field, the one with the number, is now 12 positions wide, and the third field is now 20 positions wide. This file can be imported with this command:
bcp tempdb..fixedlength in fixedlength4.txt -c -t -T
Note that there is no argument after the
‑t option. Apparently this means "no terminator at all" and it implies fixed length. I say apparently, because I find no documentation on it, so we will have to consider it unsupported. I have not been able to work out anything corresponding with BULK INSERT or OPENROWSET(BULK).
Is this practically useful? For the example above, it may not really seem so. The int column is twelve positions in the file. Twelve? A 32-bit number can have at most ten digits in string format. Add to that a leading sign, so eleven is what I would expect. But for some reason it is twelve. Even if the target table only has string columns, it still seems fragile to assume a perfect match.
However, it's a little different when you are exporting data. Say that you need to produce a fixed-length file with BCP from some specification that has landed on your desk. You could use a format file, but then that would be one more file to maintain and keep track off. You may prefer to use the
queryout option with a query like:
SELECT convert(char(6), col1), convert(char(23), col2), ...
Not only does this save you from having a format file, but you also have all information about the widths the in same place as the query and the BCP command.
And before you ask, yes, you can also use
‑r without an argument. This means that also the last field in the file is of fixed length. And there is no line break at all in the file. (Save for the line breaks that may appear in the data.)
Note: I became aware of this possibility only as I was finalising this article thanks to a forum question from Fredrik Magnusson who was using this technique to produce a fixed-length export file.
Occasionally, you may want to export a BLOB, for instance a picture, to a file. That would be a single column in a single row. As an example, we have this table:
CREATE TABLE imgs (id int NOT NULL PRIMARY KEY, name nvarchar(40) NOT NULL, type char(3) NOT NULL, img varbinary(MAX) NOT NULL)
The file imgs.sql loads this table with three small pictures, here shown abbreviated. (This file is included in the bulkload.zip archive together with all the sample data and format files.)
INSERT arrows VALUES (1, 'right', 'gif', 0x4749463837611D0... INSERT arrows VALUES (2, 'left', 'gif', 0x4749463837611D0... INSERT arrows VALUES (3, 'up', 'gif', 0x4749463837610B0...
To export any of these pictures to disk, we need a format which does not have any terminator and nor length of any sort. This is img.fmt:
9.0 1 1 SQLBINARY 0 0 "" 1 col ""
To export the pictures, you can run:
bcp "SELECT img FROM imgs WHERE id = 1" queryout rgt.gif -T -f img.fmt -d tempdb bcp "SELECT img FROM imgs WHERE id = 2" queryout lft.gif -T -f img.fmt -d tempdb bcp "SELECT img FROM imgs WHERE id = 3" queryout up.gif -T -f img.fmt -d tempdb
If you open the images, you will see three small arrows point in the directions indicated.
Beware that you cannot use this format file to load a BLOB; BCP will spin forever loading 0x into millions and millions of rows. The proper way to load a single object with bulk-load is to use OPENROWSET (BULK) with the SINGLE_xLOB options as I described earlier.
In AD 2016, Santa Claus came early to all bulk-loaders out there. Microsoft released CTP 1.1 of SQL Server vNext, and one new feature was that BULK INSERT and OPENROWSET(BULK) now support CSV files out of the box. Recall the file spreadsheet.txt above:
1;This text does not include a comma;2012-08-09 2;But in this text, there is a comma;2013-02-28 3;"And in this text; there is a semicolon";2013-09-08 4;"Part of this text is ""quoted""";
To load this file into the table formatdemo, you can in SQL Server vNext use this command:
BULK INSERT formatdemo FROM 'C:\temp\spreadsheet.txt' WITH (DATAFILETYPE = 'char', FORMAT = 'CSV', FIELDTERMINATOR = ';')
FORMAT = 'CSV', that is where all the magic is. Here I had to specify FIELDTERMINATOR since my file has semicolon as separator. If you leave out FIELDTERMINATOR, the separator defaults to comma when you specify FORMAT = 'CSV'. If your file would happen to use a different quote character, there is an option FIELDQUOTE to override the default of the double quote.
You can still use format files when you use FORMAT = 'CSV', and they are still compulsory with OPENROWSET(BULK), as in this example:
SELECT * FROM OPENROWSET(BULK 'C:\temp\spreadsheet.txt', FORMAT = 'CSV', FORMATFILE = 'C:\temp\spreadsheet.fmt') AS x
This is how spreadsheet.fmt looks like:
9.0 3 1 SQLCHAR 0 0 ";" 1 idcol "" 2 SQLCHAR 0 0 ";" 2 textcol Latin1_General_CI_AS 3 SQLCHAR 0 0 "\r\n" 3 datecol ""
Note that no quotes appears among the field delimiters in the file.
When you use FORMAT = 'CSV' you lose some of the degrees of freedom with format files. All fields must have a terminator, and furthermore all fields but the last one must have the same terminator. (A fairly reasonable restriction one has to say.)
What about BCP? In CTP 1.1, there is no enhancement of this kind to BCP, but maybe it comes in a later build, who knows.
So far we have only talked about files with 8-bit data. Now it is the time to look at loading Unicode data.
Unicode data can be stored in different encodings. In the Western world, UTF‑8 is popular since it permits text in the Latin script to be stored with one byte for most characters, whereas it less apt for East-Asian languages where you typically need three bytes per character. In the UTF‑16 encoding, all characters in the Unicode base plane (which encompasses the characters from all living languages) take up two bytes. UTF‑16 comes in two flavours: big endian (UTF‑16BE) where "ABC" is stored as
0x004100420043 and little endian (UTF16-LE) where bytes in each 16-bit word are reversed, so that "ABC" is stored as
0x410042004300. The reason to use one or the other is largely related to the processor architecture. Windows uses UTF‑16LE as its normal Unicode representation. (Intel processors are little endian.)
To permit applications to know the encoding of a file, Unicode files often start with a byte-order mark (BOM) which is
0xFEFF for UTF16-BE,
0xFFEF for UTF16-BE and
0xEFBBBF for UTF‑8. Without a BOM, applications need to be told the encoding explicitly or use heuristics.
In this chapter we will work with this table:
CREATE TABLE unicodetest (a nvarchar(9) NOT NULL, b nvarchar(9) NOT NULL, c nvarchar(9) NOT NULL, d nvarchar(9) NOT NULL)
We will work with a number of data files that when presented correctly for a human look all like this:
中山|άλφα|Київ|Latin 中山|άλφα|Київ|Latin 中山|άλφα|Київ|Latin 中山|άλφα|Київ|Latin
In the zip archive, there are a number of data files with name starting in unicode followed by the encoding. If the name includes BOM, the file has a byte-order mark, else not. If you only see empty rectangles instead of the Chinese, Greek and Cyrillic characters, you may not have a font with full Unicode support. In that case, you can work with an alternate set of files with the name starting in collation, which contains the Nordic place names we saw earlier. Beware that not all files may display correctly in your text editor; particularly the BOM-less files constitute a challenge for them.
As far as Windows and SQL Server is concerned, UTF‑8 is just another multi-byte character set and not Unicode, and UTF‑8 files are handled as other 8-bit files. That is, you use the
‑c option with BCP and
DATAFILETYPE = 'char' with BULK INSERT and OPENROWSET(BULK). If you use a format file, you use SQLCHAR for the data type. The one thing that is different is that you leave the collation property as
"", since there is no collation based on UTF‑8. Instead you use the options
-C (with BCP) and CODEPAGE (with BULK INSERT and OPENROWSET(BULK)) to specify that the code page is 65001.
Support for loading UTF‑8 files was for a long time missing in SQL Server and it appeared first with the release of SQL 2016 and it was also backported to SQL 2014 SP2. However, you can still load UTF‑8 data to earlier versions of SQL Server, as long as you use BCP from SQL 2014 SP2 or later. If you attempt to use code page 65001 on SQL 2008 to SQL 2014 SP1 (or with BCP from any of these versions), you will find that this code page is explicitly blocked. The code page is accepted on SQL 2005, but the result may be incorrect as testified by this Connect item (and this item triggered Microsoft to block UTF-8, until they had a solution.)
Note: Beware that if you have several versions of the SQL Server tools installed on your machine, your oldest version of BCP is likely to come first in the PATH, since the SQL Server Setup adds folders to the end of PATH. You can inspect your BCP version with the command
BCP ‑v. It should report a version of 12.0.5000.0 or higher to be able to handle UTF-8 files.
Here are some example commands for loading UTF‑8 files, one with BOM and one without.
bcp tempdb..unicodetest in unicode-UTF8-BOM.txt -T -c -t "|" -C 65001 BULK INSERT unicodetest FROM 'C:\temp\bulkload\unicode-UTF8.txt' WITH (DATAFILETYPE = 'char', CODEPAGE = 65001, FIELDTERMINATOR='|')
You can inspect the result with this query:
SELECT datalength(a), * FROM unicodetest
datalength reports 4 for all rows, regardless if file has a BOM or not. That is, BCP and BULK INSERT are wise enough to not load the BOM.
If the terminator is outside the ASCII range, things become a little complicated. There is a test file UTF8-BOM-sect.txt which has the data above, but with the section character (§) is the field terminator. To get it to work, the terminator must be specified as a byte sequence. That is, you need to use the characters which in your ANSI code page corresponds to the bytes that represents § in UTF‑8. On my machine which uses code page 1252, the sequence is Â§. (The fact that the section character reappears in the sequence is a bit of coincidence.) Any of these commands work:
bcp tempdb..unicodetest in UTF8-BOM-sect.txt -c -t Â§ -C 65001 -T
bcp tempdb..unicodetest in UTF8-BOM-sect.txt -f UTF8-sect.fmt -C 65001 -T
The same rule about the terminator being a byte sequence applies to the format file UTF8-sect.fmt:
9.0 4 1 SQLCHAR 0 0 "Â§" 1 a "" 2 SQLCHAR 0 0 "Â§" 2 b "" 3 SQLCHAR 0 0 "Â§" 3 c "" 4 SQLCHAR 0 0 "\r\n" 4 d ""
If you try to load this file with BULK INSERT, with or without the format file, you are out of luck. Why it fails, I don't know, but I have reported a bug on Connect.
If you use BCP to export files, it seems that it never writes a BOM to the file, format file or not. If you would absolutely need the BOM, you would have to handle this yourself with a solution similar to the one we used when we added headers.
When loading UTF‑16 files without using a format file with BULK INSERT, you need to change the argument to DATAFILETYPE to be
widechar. There is no change in how you specify the field terminator. Of four test files in UTF‑16 encoding, two load successfully:
BULK INSERT unicodetest FROM 'C:\temp\bulkload\unicode-UTF16LE-BOM.txt' WITH (DATAFILETYPE = 'widechar', FIELDTERMINATOR = '|') BULK INSERT unicodetest FROM 'C:\temp\bulkload\unicode-UTF16BE-BOM.txt' WITH (DATAFILETYPE = 'widechar', FIELDTERMINATOR = '|') SELECT datalength(a), * FROM unicodetest
That is, BULK INSERT is able to handle both big-endian and little-endian files when there is BOM. And the SELECT shows that it understands to exclude the BOM from the data. If you attempt to load the files unicode-UTF16LE.txt and unicode-UTF16BE.txt, BULK INSERT takes exception of the missing BOM and tells you DataFileType was incorrectly specified as widechar. DataFileType will be assumed to be char because the data file does not have a Unicode signature and it goes downhill from there.
With BCP, you use the option
-w to specify that the file is a Unicode file. Just like BULK INSERT, BCP is able to load two the four test files – but not the same two:
bcp tempdb..unicodetest in unicode-UTF16LE-BOM.txt -T -w -t"|" bcp tempdb..unicodetest in unicode-UTF16LE.txt -T -w -t"|"
That is, BCP is able to load the little-endian files, no matter they have a BOM or not. If you run the SELECT above, you will see that BCP is able to strip out the BOM. If you attempt to load any of the big-endian files, you get the error Unexpected EOF encountered in BCP data-file. Presumably, this is because BCP is looking for the byte sequence 0x7C00, which does not appear in the file. (7C is the ASCII value in hex for the pipe character.)
When you use a format file for UTF‑16 data, the data type, the second property, needs to be SQLNCHAR. The terminators are tricky. The old-style format files are strictly 8-bit, and you need to enter the two bytes separately. That is, the field separator is |, but you also need to include the zero-byte which you do with
\0. And since the bulk-load tools read a binary stream, you need to specify them in the order of the byte stream, that is, in reverse order. This results in the following format file for our Unicode files, unicode.fmt:
9.0 4 1 SQLNCHAR 0 0 "|\0" 1 a "" 2 SQLNCHAR 0 0 "|\0" 2 b "" 3 SQLNCHAR 0 0 "|\0" 3 c "" 4 SQLNCHAR 0 0 "\r\0\n\0" 4 d ""
The collation property is blank for all columns. There is no need to specify the collation for a Unicode file, since there is no code-page issue here. The one exception is if you want to run collation-dependent queries on the file with OPENROWSET(BULK).
Note: it seems from my initial testing that when you use FORMAT = 'CSV' in CTP 1.1 of SQL Server vNext, you don't need to include the
\0 in the field delimiter.
With this format file, it is possible to load three of the files correctly with BULK INSERT: the two LE files as well as the BE file with the BOM. The latter is actually a little surprising, since the since the byte pattern
|\0 does not occur in this file. Presumably, BULK INSERT applies byte-swapping on BE files before it looks for terminators.
To load the BOM-less big-endian file, you can use a format file with the bytes in normal order. That is,
"\0|" instead of
"|\0" etc. (You can find this in unicode-UTF16BE.fmt). However, while the file unicode-16-BE.txt loads without errors the result is garbage, since the characters are stored with their bytes swapped. That is, you can only load big-endian files with BULK INSERT correctly if they have a BOM. (And this is not very remarkable at all. Since BULK INSERT binary to its soul, there is little reason to expect it to use heuristics to determine that it is looking at a big-endian file.)
When you use BCP with format files, the situation is bleaker. BCP still can't load big-endian files correctly, BOM or not. The file may load, but the result is garbled, since the bytes are stored as-is and not swapped. What worse is, when you have a format file, BCP does not handle the BOM correctly for little-endian files. Truncate the unicodetest table, and then run this:
bcp tempdb..unicodetest in unicode-UTF16LE-BOM.txt -T -f unicode.fmt bcp tempdb..unicodetest in unicode-UTF16LE.txt -T -f unicode.fmt
Then look at the data with the query we used above:
SELECT datalength(a), * FROM unicodetest
This is the result:
a b c d
----------- --------- --------- --------- ---------
6 中山 άλφα Київ Latin
4 中山 άλφα Київ Latin
4 中山 άλφα Київ Latin
4 中山 άλφα Київ Latin
4 中山 άλφα Київ Latin
4 中山 άλφα Київ Latin
4 中山 άλφα Київ Latin
4 中山 άλφα Київ Latin
That is, the value in column a in the first row is six bytes, not four. Those two extra bytes are the BOM. This means that you can only load UTF16-LE files with a BOM with BCP if you are in the lucky position to be able to skip those bytes. (Because you have a header and use
-F 2 to skip it, or the first field is quoted and you have an initial dummy field to skip that quote.) In a case like this when the first column is a character column, you could possibly strip out the BOM in a later step, but if the first column is supposed to be a number or a date, you are likely to get a conversion error.
Given that BCP handles BOM correctly for UTF‑8 files, I think this has to be construed as a bug, and I have submitted a report on Connect.
My curiosity had me to try terminators in the higher ranges. I tested two box-drawing characters, U+2503 and U+251D. I was able to load a file with the first of these by specifying the separator as
"\3%". (The ASCII code for the per-cent character is 37, that is, 25 in hex.) But I was not able to find a way to find a way to express the byte 0x1D so that BCP or BULK INSERT would understand it. (There are no files in the zip archive to display this, as this was a quick test that I ran.)
When it comes to exporting data, BCP behaves in the same way as on import. That is, it writes a BOM to the data file when you do not use a format file, but if you use a format file it does not write a BOM.
The file fixedlength-UTF16-BOM.txt is a Unicode version of the fixed-length file we looked at earlier. To load this file, we need a format file, and the one thing to keep in mind when loading fixed-length Unicode files is that the length should be specified in bytes, not in characters. This is fixedlength-UTF16.fmt:
9.0 5 1 SQLNCHAR 0 28 "" 1 a Latin1_General_CI_AS 2 SQLNCHAR 0 16 "" 2 b "" 3 SQLNCHAR 0 34 "" 3 c Latin1_General_CI_AS 4 SQLNCHAR 0 2 "" 4 d Latin1_General_CI_AS 5 SQLNCHAR 0 0 "\r\0\n\0" 0 "" ""
Using this format file, you can load fixedlength-UTF16.txt into the table fixedlength with BULK INSERT. However, if you use BCP, the first row is not loaded correctly, because of the BOM. The only time you can load a fixed-length file with a BOM correctly with BCP is when the field lengths matches your table for all columns, so that you can use
-t without an argument to do it without a format file. The file fixedlength4-UTF16-BOM.txt is such a file, and this works:
bcp tempdb..fixedlength in fixedlength4-UTF16-BOM.txt -T -w -t
Hopefully, you will never have to load fixed-length files encoded in UTF‑8. Or at least not files where the fields have a fixed length in number of characters, because as we have seen, the bulk-load tools count in bytes.
In this chapter we will look at a few more bulk-load options that affects the result of loading a file.
Here is something important. As reminder, this is formatdemo.txt:
11,"Here is some text in quotes","2012-12-12" 12,"And this, is text with a comma","2013-09-02"
Consider this table definition:
CREATE TABLE formatdemo (idcol int NOT NULL, textcol varchar(50) NOT NULL, datecol datetime NOT NULL CHECK (datecol > '20140101'))
As you can see, both dates violate the constraint. Nevertheless, both these commands succeed:
BULK INSERT formatdemo FROM 'C:\temp\bulkload\formatdemo.txt' WITH (FORMATFILE = 'C:\temp\bulkload\formatdemo.fmt')
bcp tempdb..formatdemo in formatdemo.txt -T -f formatdemo.fmt
With OPENROWSET(BULK) it's different:
INSERT formatdemo (idcol, textcol, datecol) SELECT * FROM OPENROWSET(BULK 'C:\temp\bulkload\formatdemo.txt', FORMATFILE = 'C:\temp\bulkload\formatdemo.fmt') AS T
Here you get a constraint violation.
This is the default behaviour, and you can change it. With BULK INSERT you use the option CHECK_CONSTRAINTS:
BULK INSERT formatdemo FROM 'C:\temp\bulkload\formatdemo.txt' WITH (FORMATFILE = 'C:\temp\bulkload\formatdemo.fmt', CHECK_CONSTRAINTS)
With BCP you use the option
-h CHECK_CONSTRAINTS. And if you want to suppress constraint checking with OPENROWSET you can use the table hint IGNORE_CONSTRAINTS
INSERT formatdemo WITH (IGNORE_CONSTRAINTS) (idcol, textcol, datecol) SELECT * FROM OPENROWSET(BULK 'C:\temp\bulkload\formatdemo.txt', FORMATFILE = 'C:\temp\bulkload\formatdemo.fmt') AS T
Note: This INSERT hint only works with OPENROWSET(BULK); you cannot use it when you insert from other sources.
The example is a CHECK constraint, but the same applies to foreign keys.
I quickly mentioned these options earlier, and I said that you always should include them with BCP and BULK INSERT. There are two reasons for this:
Skipping the constraint-check can give some performance gain with the bulk load, but from the second point follows that what you gain on the swings you can lose manifold on the roundabout.
If you suspect that you at some point got it wrong, you can review if you have any untrusted constraints this way:
SELECT object_name(parent_object_id) AS [Table], name AS [Constraint] -- FROM sys.check_constraints FROM sys.foreign_keys WHERE is_not_trusted = 1
To fix the issue, you can do
ALTER TABLE tbl WITH CHECK CHECK CONSTRAINT constraint_name
And, yes, there should be two CHECK. Once you have validated the constraints this way, the constraints are marked as trusted. (If there is data in the table that violate the constraint, the operation fails, not surprisingly.)
One situation where you may want to load tables without constraint checking is when you are loading data to a couple of tables and you don't want to be bothered by the dependency order. In this case you can load the data first and then check the constraints with ALTER TABLE when all files have been loaded.
The bulk-load tools have the same attitude against triggers. That is, by default BCP and BULK INSERT do not fire them, while OPENROWSET(BULK) does. To override this with BCP, you use the option
-h FIRE_TRIGGERS. With BULK INSERT, the option is FIRE_TRIGGERS and with OPENROWSET(BULK) you can use the table hint IGNORE_TRIGGERS. You should make a judicious choice whether to run with triggers or not. In difference to constraint checking, there is no cost that comes later if you run without them. However, keep in mind that triggers may perform important integrity checks or cascading updates that you need.
-h option to BCP is used for a number of so-called load hints. If you need more than one of them, include one
-h option and separate the arguments by comma:
-h "CHECK_CONSTRAINTS, FIRE_TRIGGERS"
By default, BCP and BULK INSERT ignore the values in the source file for a target column that has the IDENTITY property. This example illustrates:
CREATE TABLE formatdemo (idcol int NOT NULL IDENTITY, textcol varchar(50) NOT NULL, datecol datetime NOT NULL) BULK INSERT formatdemo FROM 'C:\temp\bulkload\formatdemo.txt' WITH (FORMATFILE = 'C:\temp\bulkload\formatdemo.fmt') SELECT * FROM formatdemo
The output is:
idcol textcol datecol
----------- ------------------------------- -----------------------
1 Here is some text in quotes 2012-12-12 00:00:00.000
2 And this, is text with a comma 2013-09-02 00:00:00.000
You may recall that the values for idcol in the file are 11 and 12, not 1 and 2.
You can override this with the command-line option
-E for BCP. With BULK INSERT the option is KEEPIDENTITY:
BULK INSERT formatdemo FROM 'C:\temp\bulkload\formatdemo.txt' WITH (FORMATFILE = 'C:\temp\bulkload\formatdemo.fmt', KEEPIDENTITY)
With INSERT SELECT FROM OPENROWSET(BULK) the behaviour is slightly different; as with other INSERT statements, you get an error if you explicitly attempt to insert values in the IDENTITY column. You can use the table hint KEEPIDENTITY hint to override. You must specify a column list for INSERT in this case:
INSERT formatdemo WITH (KEEPIDENTITY) (idcol, textcol, datecol) SELECT * FROM OPENROWSET(BULK 'C:\temp\bulkload\formatdemo.txt', FORMATFILE = 'C:\temp\bulkload\formatdemo.fmt') AS T
Again, this is a hint that is only available when you use OPENROWSET(BULK).
One problem with plain-text files is that there is not really a possibility to distinguish between NULL values and empty strings. In this section we will take a look at how the bulk-load tools handle this. We will work with this table:
CREATE TABLE empty (a int NULL, b int NULL DEFAULT 4711, c varchar(20) NULL, d varchar(20) NULL DEFAULT 'Empty', e char(29) NULL, f datetime NULL, g datetime2(0) NULL)
And we have the data file empty.txt:
;;;;;;! ; ; ; ; ; ; ! ; ; ; ; ; ; ! ; ; ; ; ; ; !
The field terminator here is semicolon, and each record has a number of spaces in the field. The record terminator is !-CR-LF. The sole purpose of the exclamation mark is to make the spaces in the last field visible. We load this file with BCP:
bcp tempdb..empty in empty.txt -T -t; -r"!\n" -c
To look at the table, we use this query:
SELECT *, datalength(c), datalength(d), datalength(e) FROM empty
we see this:
a b c d e f g
------ ------ ------ ------- ------ ------ ------ ----- ----- -----
NULL 4711 NULL Empty NULL NULL NULL NULL 5 NULL
NULL 4711 NULL NULL 1 1 29
NULL 4711 NULL NULL 2 2 29
NULL 4711 NULL NULL 3 3 29
What does all this mean? Let's first look at the non-string columns, that is, a, b, f and g. For such columns an empty field or a field consisting of spaces alone is always interpreted as NULL. But if the column has a default, the default value is inserted in place of NULL. If we now look at the string columns c, d and e, we see that the result is different depending on whether the field is entirely empty or includes a number of spaces. Only if the field is empty, the result is NULL or any default value. If there are actual spaces in the field, these are taken at face value.
Let's also test how BULK INSERT works:
BULK INSERT empty FROM 'C:\temp\bulkload\empty.txt' WITH (DATAFILETYPE = 'char', FIELDTERMINATOR = ';', ROWTERMINATOR = '!\n')
The result is almost the same, but there is a difference for column g. Here the spaces are inserted as 1900-01-01 00:00:00. This is consistent with how space is converted to date/time in general, but it is inconsistent with how space is handled by bulk-load in general. As for other data types, I have not tested all, and if there is one that is particularly important to you, I would recommend that you test before you go too far.
Turning to OPENROWSET, there is yet a difference. For OPENROWSET we need the format file empty.fmt:
9.0 7 1 SQLCHAR 0 0 ";" 1 a "" 2 SQLCHAR 0 0 ";" 2 b "" 3 SQLCHAR 0 0 ";" 3 c "" 4 SQLCHAR 0 0 ";" 4 d "" 5 SQLCHAR 0 0 ";" 5 e "" 6 SQLCHAR 0 0 ";" 6 f "" 7 SQLCHAR 0 0 "!\r\n" 7 g ""
We run the command:
INSERT empty SELECT * FROM OPENROWSET(BULK 'C:\temp\bulkload\empty.txt', FORMATFILE = 'C:\temp\bulkload\empty.fmt') AS T
In difference to BCP and BULK INSERT, the default values for columns b and d are not inserted, but we get NULL here.
What you have seen here is only the default behaviour. You can instruct BCP to keep NULL values in the input and ignore the default value for the target columns with the option
-k. The same option with BULK INSERT is KEEPNULLS. And for INSERT there is an OPENROWSET(BULK)-specific hint, KEEPDEFAULTS to get the default behaviour of BCP and BULK INSERT.
There is one thing you cannot do with BCP or BULK INSERT: have multiple spaces for a string column to be interpreted as NULL. This can be problematic, if you want a blank field in a fixed-length file to result in NULL. With OPENROWSET(BULK) you can work round this issue, by using
nullif(col, '') in your SELECT list.
So far about importing data. But what happens when export data? First load the table with data:
TRUNCATE TABLE empty INSERT empty VALUES(NULL, 19, NULL, '', '', NULL, NULL)
Then export the data with:
bcp tempdb..empty out emptyout.txt -T -c -t; -r"!\n"
A casual look at the file in the command-line window looks like this:
;19;; ; ;;!
It seems here that the empty string for column d in the first row was exported as a single space. But if you import the file again, you will find that d has been resurrected as an empty string! If you look at the file in a hex editor, you will see that the character in emptyout.txt between the third and fourth semicolons is not a space, but a NUL character (i.e. ASCII 0). I said in the beginning of this section, in a text format there is no way to distinguish between empty strings and NULL values. What the bulk-load tools have achieved with this little trick is that they permit this distinction at the sacrifice of the distinction between an empty string and a single NUL character – a value not encountered very often in a text file.
In this article, I have tacitly assumed that once you have determined the correct format file, the load will succeed without problems. But not surprisingly, errors can occur with bulk-load.
If the format specification is that wrong so that the tool cannot find the expected terminator at the end of the file, this will cause the entire load to fail, and no rows will be loaded (unless you have set a batch size, discussed in the next session). Another scenario is that the format specification is incorrect, but by some lucky circumstance the bulk-load tool is able to identify what it assumes to be complete records. As you may recall, this was exactly what happened in the very first example we looked at in this article. In many cases, however, this situation results in a flood of error messages, because the load tool will try to cram string values into integer columns and the like. With some luck, nothing is loaded, but you could also find yourself with a table of garbage.
Then there is the third situation, the format specification agrees with the file, but the file has occasional bad data, for instance an incorrect date such as 2014-02-30. The tools are somewhat liberal, and by default they accept ten such errors before they call it a day. This may or may not be what you want, and there are two options for error handling.
One is an option to permit you to change the limit how many errors you accept. For instance, to say that you don't accept any error at all, you would use
-m 0 with BCP and
MAXERRORS = 0 with BCP and OPENROWSET(BULK).
You can also request that errors should be written to a separate file. With BCP you use the option
-e. BCP will not fuzz if the file exists, but simply overwrite what is there. With BULK INSERT and OPENROWSET(BULK) the option is ERRORFILE, and this option creates two files, one with the data itself and one with detailed information. The first of these files has the name you specified with the ERRORFILE option, the second file has Error.Txt tacked on. BULK INSERT and OPENROWSET(BULK) consider it an error if any of these files already exist, which certainly is irritating.
If you want to use TRY-CATCH with BULK INSERT or OPENROWSET(BULK), there are a couple of gotchas, that I discuss in a chapter in Part Two of my series Error and Transaction Handling in SQL Server.
By default, the bulk-load operation is a single transaction. When loading big files, this can cause quite a strain on the transaction log. For this reason, the bulk-load tool offer the possibility to commit rows regularly, so that the log can be truncated (which requires that you are in simple recovery or back up the transaction log while the operation is running). There are no less than three ways to do this:
-bwith BCP and BATCHSIZE with BULK INSERT. It does not appear to be available with OPENROWSET (BULK).
-h "KILOBYTES_PER_BATCH=nn"with BCP and KILOBYTES_PER_BATCH with BULK INSERT.
-h "ROWS_PER_BATCH=nn"with BCP and ROWS_PER_BATCH with BULK INSERT and OPENROWSET(BULK).
When it comes to the last one, I will have to admit that I foggy on the difference between this option and BATCHSIZE. Actually, when reading Books Online, it seems that the option is an approximate indication of the number of rows in the file and it serves as a hint to the optimizer for memory allocation. I have not tested or used this option myself.
When you set a batch size, and there later is an error which causes the load tool to give up, the rows that have been committed will obviously remain in the table. If you have a file with some anomaly at the end, for instance a deviating footer, you could use BATCHSIZE=1 to load the file. However, beware that if you do this, you need to make sure that you do not qualify for a minimally logged operation, because in this case each row will be written to its own 64 KB extent, and you will waste a lot of space. Generally, it is better to read the file with OPENROWSET(BULK) and count lines and then use LASTROW to eliminate the footer, as we saw earlier.
Two other performance-related options are ORDER and TABLOCK with BULK INSERT; with BCP they are
-h options. ORDER indicates that the file is sorted according to the order you specify, which improves speed when loading into a clustered index, as SQL Server does not have to sort the data. TABLOCK takes a table lock on the table, and this is one of the requirements for a minimally logged operation.
I recommend that if you plan to use any of these options, or you want to make sure that your bulk-load is minimally logged that you read the relevant topics in Books Online, as this strays outside the scope for this article.
There are a few options that are specific to BCP. Some of them are related to the login process, for instance
-K to specify the application intent when you connect to an availability group. I will not detail all here; you can review the available options with
bcp -?. Here I will only mention two:
By default, BCP connects with the setting QUOTED_IDENTIFIER OFF. This will result in an error if you target table has a filtered index, an index on a computed column or is part of an indexed view, since operations on such indexes require QUOTED_IDENTIFIER to be ON. You can use the option
-q to override the default. In fact, there is no menace to always have this option as part of your BCP command. The likelihood that your BCP command actually would require QUOTED_IDENTIFIER to be OFF is ridiculously small. (QUOTED_IDENTIFIER OFF is a pure legacy option, but for some reason several of the SQL Server command line tools as well as SQL Server Agent default to having this setting OFF.)
By default, BCP ignores the regional settings on your computer, which has to be said to be a good thing, because this gives a uniform behaviour from computer to computer. But if you have a file where dates are formatted according to your regional settings, use the option
To run a bulk-load operation, you need SELECT and INSERT permission on the target table. BULK INSERT or OPENROWSET (BULK) also require permission on server level, ADMINISTER BULK OPERATIONS. This permission can also be obtained by being member of the fixed server role bulkadmin.
To insert explicit values in IDENTITY columns or to skip constraint checking and firing triggers, you also need ALTER permission on the target table. Recall that skipping constraint checking and triggers is the default behaviour with BCP and BULK INSERT.
If you want to use BULK INSERT or OPENROWSET(BULK) in a stored procedure that is to be executed by plain users without elevated permission, you can arrange this by signing the procedure with a certificate and from that certificate create a login which has the required permissions. Please see my article Granting Permissions through Stored Procedures, where I discuss this technique in detail. You also find an example for BULK INSERT in this article.
Permissions inside SQL Server is not all. You also need permission to read the data file. With BCP this should not be much of a problem as it is like any other file permission. When you use BULK INSERT, things are a little more complicated, though. If you log in with Windows authentication, SQL Server will impersonate your login to access the file. That is, it is your permissions to the file that matters. However, if you log in on computer A, SQL Server is on computer B and the file is on a share on computer C, this may fail for a number of reasons. SPN for Kerberos may not be set up correctly, or the SQL Server machine or the service account is not trusted for delegation just to mention a few possibilities. You will probably have to work with your Windows administrator to resolve such problems.
If you log in with SQL Server authentication, SQL Server will access the file as itself, that is, the permission of service account for SQL Server will apply. Thus, if you get a permission error in this case, you need to check is the permissions for this account. If you don't know the service account, you find this in the SQL Server Configuration Manger. Keep in mind that if the file is on a different computer, and the service account is an account local to the machine, for instance NT SERVICE\MSSQLSERVER, what counts are the permissions for the machine account, that is, DOMAIN\MACHINE$.
Also, keep in mind that if you get error 5 = Access is denied, this does not have to be a permission error. It can also indicate that another process has the file locked.
XML format files were added in SQL 2005. Personally, I find them overly verbose, but admittedly, there are some things that are cleaner in the XML format and they permit you to specify some properties that are not possible to express with old-style format files. However, they are of fairly marginal benefit. At the same time, there is one thing that you easily can do with the old-style format files that you cannot do with XML files: skip a column in the target table.
We have several times looked at the data file formatdemo.txt:
11,"Here is some text in quotes","2012-12-12" 12,"And this, is text with a comma","2013-09-02"
Here is forrmatdemo.xml which is the XML version of formatdemo.fmt:
<?xml version="1.0"?> <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RECORD> <FIELD ID="One" xsi:type="CharTerm" TERMINATOR=",""/> <FIELD ID="Two" xsi:type="CharTerm" TERMINATOR="","" COLLATION="Latin1_General_CI_AS"/> <FIELD ID="Three" xsi:type="CharTerm" TERMINATOR=""\r\n"/> </RECORD> <ROW> <COLUMN SOURCE="One" NAME="idcol"/> <COLUMN SOURCE="Two" NAME="textcol"/> <COLUMN SOURCE="Three" NAME="datecol"/> </ROW> </BCPFORMAT>
The outermost BCPFORMAT tag is just a standard blurb that is the same in all XML format files, but the tag must always look this way. That is, the namespace declarations must be there. (The leading
<?xml> tag is not required, though.) I assume that the first namespace declaration serves to state the version of the format-file format, but there have been no changes since the introduction in SQL 2005. In difference to old-style format files, you do not declare the number of fields in a record anywhere.
Then there are two second-level elements, RECORD and ROW. RECORD has a set of FIELD elements, while ROW has a set of COLUMN elements. This is certainly a clean division that uses the correct terminology. RECORD and FIELD describes the contents in the data file, while ROW and COLUMN describes the resulting table of the bulk-load. That is, either the target table or the columns generated when you use OPENROWSET (BULK).
The FIELD elements have a number of attributes, but there is no attribute that corresponds to the first property of the old-style format file. Instead the field order is simply given by order of the FIELD elements in the format file. On the other hand, there is a mandatory attribute ID which serves as mapping between FIELD and COLUMN elements. In practice, you will use 1, 2, 3, 4... as the values for this attribute, but you are free to use whatever you like, as testified by the example above.
A mandatory attribute in FIELD elements is xsi:type, which can have any of these eight values: CharTerm, NCharTerm, CharFixed, NCharFixed, NativeFixed, NativePrefix, CharPrefix and NCharPrefix. Of these, you would use the first four for text files, while the last four apply to binary data files only. Char means that the data is 8-bit, while NChar is for Unicode (UTF‑16) data. Fixed means the data is fixed length, while Term means that the field has a terminator.
In this example, we have a file with terminators, and we must use the TERMINTATOR attribute to specify the terminator. If you thought it was a little messy to say
"\",\"" in old-style format files, it does not get any better here, but you need to use XML-entities and say
"","". (And if you mistakenly use
"e; instead, the error message may not be that helpful.)
If you have a Unicode data file, you may think that you can save the XML format file as UTF‑8 or UTF‑16 and then you can specify the terminator as a single character to save you from this funny
|\0 dance we saw earlier. Oh no, you still need to specify the terminator byte by byte.
As the example suggests, you can use the COLLATION attribute to specify the collation. As with old-style format files, you should specify the collation for character data in 8-bit files, unless you use the
-C or CODEPAGE options to override the default of the OEM character set.
You can also specify MAX_LENGTH. This has the same effect as specify the fourth property (the length) in an old-style format file when you also specify a terminator (the fifth property). That is, BCP flatly ignores it, while BULK INSERT and OPENROWSET(BULK) raise an error if the length is exceeded. Before we take a closer look at the COLUMN element, let's take a quick look at fixedlength2.xml that corresponds to fixedlength2.fmt that we look at earlier:
<?xml version="1.0"?> <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RECORD> <FIELD ID="1" xsi:type="CharFixed" LENGTH="14"/> <FIELD ID="2" xsi:type="CharFixed" LENGTH="8"/> <FIELD ID="3" xsi:type="CharFixed" LENGTH="17"/> <FIELD ID="4" xsi:type="CharFixed" LENGTH="1"/> <FIELD ID="5" xsi:type="CharTerm" TERMINATOR="\r\n"/> </RECORD> <ROW> <COLUMN SOURCE="1" NAME="x" /> <COLUMN SOURCE="2" NAME="y" /> <COLUMN SOURCE="3" NAME="z" /> <COLUMN SOURCE="4" NAME="w" /> </ROW> </BCPFORMAT>
When you have a fixed-length field, you use LENGTH to specify the length of the field. That is, the name is different from MAX_LENGTH, and Books Online suggests that they serve different purposes. However, from what I can tell, they are entirely synonymous.
You may also note here that in this format file, I simply use consecutive numbers for the ID in the FIELD element. Despite my initial example, I would recommend that you adhere to this pattern.
In the COLUMN element, you list the columns in the order they appear in the table or in the SELECT list for OPENROWSET(BULK). The SOURCE attribute maps back to the ID attribute of the FIELD element, to permit the fields in the file to come in different order than the columns in the table. You can easily skip a fields, as testified by field 5 above not being mapped to a column. (This is, as you may recall, an empty field the represents the end of line.)
On the other hand, if you want to skip a column in the target table, this is more difficult, unless all columns you want to skip come after those you want to load. But if you want to skip for instance the IDENTITY column in the table, and this column comes first, you cannot do this with BCP and BULK INSERT. (With OPENROWSET(BULK), you can always list the columns you want to insert to explicitly in the INSERT statement.) As a workaround, Books Online suggests to define a view and bulk-load into that view. Quite lame, if you ask me.
At the same time, this perfectly trivial with an old-style format file as you only list the columns you want to load.
In the COLUMN element, you can also specify a NAME attribute, and this attribute has the same effect as the seventh property in old-style format files. That is, the name is informational for BCP and BULK INSERT, but it serves to give the column name for the result set from OPENROWSET(BULK).
There are a few more attributes you can specify with COLUMN, and these attributes represent functionality which is unique to XML format files. They permit you to specify the type of the column returned, and this is mainly useful with OPENROWSET(BULK). The main attribute is xsi:type. The permitted values are entirely different from the same attribute in the FIELD element. As there are quite many of them, I don't include the full set here, but instead I refer you to the topic XML Format Files in Books Online. Depending on the type you specify, you can also specify the attributes LENGTH, PRECISION and SCALE. For all types you can specify the attribute NULLABLE with the values YES or NO.
Here is an example how you can make use of being able to specify the data type for the column. Consider the data file datatypes.txt:
3;3.14159E+0;Jeden 4;31.4159E-1;Dwa 23;314.159E-2;Trzy 102;3141.59E-3;Cztery 46;;As a starting point, we have this format file datatypes0.xml which does not use these extra attributes for COLUMN:
<?xml version="1.0"?> <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RECORD> <FIELD ID="1" xsi:type="CharTerm" TERMINATOR=";"/> <FIELD ID="2" xsi:type="CharTerm" TERMINATOR=";"/> <FIELD ID="3" xsi:type="CharTerm" TERMINATOR="\r\n"/> </RECORD> <ROW> <COLUMN SOURCE="1" NAME="a"/> <COLUMN SOURCE="2" NAME="b"/> <COLUMN SOURCE="3" NAME="c"/> </ROW> </BCPFORMAT>
We run this statement:
SELECT MIN(a), COUNT(c) --, AVG(b) FROM OPENROWSET(BULK 'C:\temp\bulkload\datatypes.txt', FORMATFILE = 'C:\temp\bulkload\datatypes0.xml') AS t
The call to AVG is commented out, because it yields an error – one cannot use AVG on character data. (Since as long we don't use xsi:type, the column has the same type as the field in the file, that is varchar.) The output of
MIN(a) is 102, which at first may seem like SQL Server has confused MIN and MAX. But SQL Server sees the data as character data, and when sorting data as strings, all strings starting with 1 comes before those starting with 2. Note also that
COUNT(c) returns 4. Recall that COUNT over a column name ignores NULL values in that column. That is, for that last line the value of c is NULL, which is consistent with what we saw earlier.
We now modify the COLUMN entries to include data type and nullability:
<COLUMN SOURCE="1" NAME="a" xsi:type="SQLINT" NULLABLE="NO"/> <COLUMN SOURCE="2" NAME="b" xsi:type="SQLFLT8" NULLABLE="YES"/> <COLUMN SOURCE="3" NAME="c" xsi:type="SQLVARYCHAR" NULLABLE="NO"/>
This is in datatypes.xml, and if we run
SELECT MIN(a), COUNT(c), AVG(b) FROM OPENROWSET(BULK 'C:\temp\bulkload\datatypes.txt', FORMATFILE = 'C:\temp\bulkload\datatypes.xml') AS t
We now get 3, 5 and 3.14159 respectively. That is, the two numeric columns are now handled as such. And the query now finds 5 non-null values in column c. Since it is declared as not nullable, the empty string is interpreted as-is and not as NULL. In the last record in file, the second field is also empty. In the format file above, b is declared as nullable. If you change the format file to say NULLABLE="NO" also for this column, the bulk-load will fail with the error The bulk load failed. Unexpected NULL value in data file row 5, column 2. The destination column (b) is defined as NOT NULL. That is, for data types where there is no reasonable conversion of an empty string, the nullability in the format file serves as an assertion.
There is obviously some marginal benefit with being able to specify the data type and nullability. But it is not often that you run queries on the data file, is it? And rather than using a format file, it may be easier to simply use convert in the query.
I mentioned in the beginning that OPENROWSET(BULK) can work differently depending how you use it. Here is an example to show this:
CREATE TABLE datatypes(ident int IDENTITY, a int NULL, b decimal(6,5) NULL, c varchar(20) NULL) go INSERT datatypes(a, b, c) SELECT * FROM OPENROWSET(BULK 'C:\temp\bulkload\datatypes.txt', FORMATFILE = 'C:\temp\bulkload\datatypes.xml') AS t go INSERT datatypes(a, b, c) SELECT a, b, c FROM OPENROWSET(BULK 'C:\temp\bulkload\datatypes.txt', FORMATFILE = 'C:\temp\bulkload\datatypes.xml') AS t go INSERT datatypes(a, b, c) SELECT * FROM OPENROWSET(BULK 'C:\temp\bulkload\datatypes.txt', FORMATFILE = 'C:\temp\bulkload\datatypes.xml') AS t WHERE 1 = 1 go SELECT ident, datalength(c) FROM datatypes WHERE a = 46 go DROP TABLE datatypes
The output from this script is:
That is, if you do INSERT ... SELECT * FROM OPENROWSET(BULK) without a WHERE clause, the setting of the NULLABLE is ignored. But if you have an explicit column list or add the simplest of WHERE clauses, the NULLABLE attribute comes into play.
This is not really all. When devising the script above, I initially had the column a as NOT NULL. This resulted in the error The bulk load failed. Unexpected NULL value in data file row 5, column 2. The destination column (a) is defined as NOT NULL for the first of the INSERT statements above. This is very confusing, since column 2 from the file should go to column b in the table as per the INSERT statement – and which it eventually does.
Leaving OPENROWSET(BULK) behind, let's briefly look at the other two. Books Online suggests a situation when specifying the type is useful also with BULK INSERT. Consider the table above, without the IDENTITY column:
CREATE TABLE datatypes(a int NOT NULL, b decimal(6,5) NULL, c varchar(20) NULL)
If we attempt to load datatypes.txt into this table with BULK INSERT using the format file datatypes0.xml, this fails because strings with scientific notation cannot be converted to decimal. If we instead use datatypes.xml, BULK INSERT succeeds, because the values are first converted to float and then to decimal. If you think this example is far-fetched, I am in no way in disagreement with you.
When it comes to BCP, Books Online says that BCP ignores the ROW element, which reasonably cannot be correct. Surely, BCP respects the field-to-column mapping. But it seems that what Books Online wants to say is that BCP ignores xsi:type and the related attributes in the ROW element. In any case, BCP is able to load datatypes.txt into the datatypes table with either format file, or for that matter no format file at all. Apparently it knows how to convert a string with an exponent to a decimal value.
In this article we have looked at how the bulk-load tools work. It has not been a complete survey. The focus has been on functionality, and we have only looked briefly on security, error handling and performance. There are two key take-aways from this article. The first is that the bulk-load tools have a binary mindset. You may look at a text file and think that is the first line, that's the second line etc. But these tools do not think in lines. They read one field at a time, consuming a binary data stream. They don't assign any particular meaning to line endings, unless it has been specified as the terminator for the field they are currently reading. To come on terms with the bulk-load tools, you need to embrace their mindset.
The second take-away is that the bulk-load tools are not universal tools for loading files. We have looked at format files and how we can use them to load certain file formats, not the least files with quoted text. As long as your file format can be described by a format file, the bulk-load tools can serve you well. But we have also learnt that there are file formats that are outside the realm that the bulk-load tools are designed for. That is, you must always be prepared to switch something else when your file does not fit into their world.
Back to my home page.