All You Wanted to Know about Collations

An SQL text by Erland Sommarskog, SQL Server MVP. Most recent update 2025-02-26.
Copyright applies to this text. See here for font conventions used in this article.

Introduction

Collations are one of these things that tend to confuse SQL Server users. And it is by no means only the true beginners I have in mind. No, I have encountered persons that I count as experts being confounded over collation behaviour. It is not that the collations as such are confusing. No, what is bewildering is the concept of human language. Doesn't everyone speak English? No. But at least all languages have the same rules as English, don't they? No, they have not. Different languages do it differently, and that is very much the theme of this article. (Lest some readers with English as their native language feel offended and think I'm taking a jab at them, I should point out that this attitude is very prevalent among my fellow Swedish countrymen as well. And probably with a few more groups in the IT community.)

The reader may be frightened by the sheer length of this article, and, true, it is not short. However, far from all contents is hard-core technical stuff. No, there is a generous dosage of examples from different languages which I hope you will find entertaining and enlightening, but it is not hard knowledge you need to know for your daily chores with SQL Server (unless you actually need to support any of these languages!). The article also covers more exotic aspects of collations and Unicode that you may never encounter in your professional life, but you may still appreciate getting an orientation about these topics. Intermingled with this lighter material there are some tips and lessons you are very likely to have use for.

The reader may ask what the prerequisites are for this article. I would say that foremost you need an open mind and an understanding of the concept that different languages do it differently. SQL-wise, there are no particular prerequisites, although the article is not really aimed at the true beginners. Consider it to be a level-200 article.

The article includes examples from a number of languages, but still more languages are left out entirely. Most examples are from languages that I have some knowledge of myself, and European languages are grossly overrepresented. There are some examples in Japanese and occasional examples from other Asian languages, but I'm completely silent on Arabic and other languages that are written right-to-left. I can only ask for forgiveness for this omission, but I am handicapped by my total lack of experience of these languages.

The article starts off with an introductory chapter that discusses what a collation controls and where you can define the collation in SQL Server. It also deals with one of the pain points with collations that many have encountered: collation conflicts. The next chapter gives an historic background, not only about the collations in SQL Server, but also about Unicode, character sets and code pages – topics I find that many experienced computer professionals have a foggy notion of. This is followed by a long chapter where I dissect collation names into components and explain what a name like Latin1_General_100_CI_AS_KS_WS_SC_UTF8 really means. Then there is a chapter of more normal length about SQL collations, a group of legacy collations from the early days of SQL Server. A short chapter covers two types of Unicode modifiers, combining accents and variation selectors.

Next chapter is directed towards ISVs whose software may be installed under different collations depending on the local language at the customer site. This chapter gives some recommendations and discusses collation-related features that can be useful in this context. Chapter eight is a chapter that lumps together a couple of topics that are too small to warrant chapters of their own. This includes a discussion of what is the default collation when you install SQL Server, how collations work with linked servers, a little trick to quickly ASCII-ify a string, and how confusing it can be if you have a T‑SQL batch that has a USE statement in the middle of it.

The penultimate chapter is about how collations affect performance. It starts with two hard lessons about situations where the collation has a huge impact. The rest of this chapter is devoted to data from performance tests I have conducted. While this may sound intriguing to some, I should warn you already at this point that this is not an essential part of the article. (Also, it may stray beyond being level 200 SQL-wise.) In the last chapter, I try to tie everything together by making recommendations for strategies for collations – including some ideas you may find to be on the wild side.

One of the most daunting tasks you can undertake in SQL Server is to change the server and/or database collation. I don't cover this topic in this article, but instead there is an accompanying article, Changing the Server and Database Collation in SQL Server, which comes with two scripts that can help you with this difficult and laboursome process. I decided to make this a separate article, since there are likely to be readers who need more or less urgent help with implementing such a change, but who have no interest in wading through all the material in this long article.

Table of Contents

Introduction

Collation Basics

What Does a Collation Control?

Where to Define the Collation?

Collation Conflicts

A History of Character Sets, Code Pages and Collations

In the Beginning There Was ASCII...

The Competing 8-bit Character Sets

Enter Unicode

Microsoft and Their Code Pages and SQL Server

The History of Collations in SQL Server

The Anatomy of a Collation Name

Collation Families

The Version Number

CI/CS – Case (In)Sensitivity

AI/AS – Accent (In)Sensitivity

KS – Kana Sensitivity

WS – Width Sensitivity

VSS – Variation-Selector Sensitivity

BIN/BIN2 – Binary Collations

SC – Surrogate Compatible

UTF8 Collations

Collation Families and Windows Locales

SQL Collations

General Observations about SQL Collations

How SQL Collations are Different

Uppercase Preference and Tertiary Weights

Compatibility Collations

Some Unicode Modifier Characters

Combining Diacritical Marks

Variation Selectors

Collations Considerations for ISVs

Azure SQL Database

Mandate the Collation

Develop Case-sensitive, Test Case-insensitive

Partially Contained Databases

The CATALOG_COLLATION Option

Various Minor Topics

Selecting the Server Collation at Setup

Linked Servers and Collations

A Trick with Fallback Characters

Multi-Database Scripts

How the Collation Affects Performance

The Data-Type Mismatch Accident

LIKE '%abc%'

Performance Tests – Introduction and Setup

Is Case-Sensitive Faster?

The Impact of the Collation Version

The Difference Between varchar and nvarchar

Binary Collations

Performance of UTF‑8 Collations

Comparing Performance of Collation Families

Executive Summary

Design Considerations and Final Thoughts

One or Multiple Collations for Your Organisation?

Which Collation Should I Choose?

The Choice of Data Types – varchar or nvarchar?

Collation for Codes

Conclusion, Acknowledgements and Feedback

Revision History

Technical Note: This article includes many characters that you do not see that often. A challenge for me is that not all characters may render correctly in everyone's browser. These characters mainly occur in code samples for demos. For this reason, I provide code samples which include characters with potential display problems in three ways:

For demos only using characters that are in Latin‑1, I typically let it suffice with the text inline, assuming that all characters will display correctly for everyone. Please let me know if you have problems with seeing some characters so that I can amend the article.

The demos create many databases. With a very few exceptions, the name of a database is always the same as the database collation. If you at some point want to get rid of these databases, there is a script drop_demo_databases.sql in the above-mentioned archive collationscripts.zip.

In the text you will see words and characters in this style. They are examples of words or characters I discuss in the article.

Note on Article Scope: Microsoft has a wide range of offerings in the data space these days, and there is a lot more than just SQL Server. About all these products have some form of collation concept. I don't have experience of all of these products myself, and trying to cover them all would make the text unwieldy. So this article is confined to on-prem SQL Server and the cloud offerings that run the same engine, that is Azure SQL Managed Instance and Azure SQL Database. But I am not discussing Azure Synapse Analytics, Fabric Data Warehouse, etc. I am also silent on Fabric SQL Database, even though it's kind of a little brother of Azure SQL Database, since it is in preview at the time of this writing.

Collation Basics

What Does a Collation Control?

If you want to view what collations there are available, you can run this query:

SELECT * FROM sys.fn_helpcollations()

Exactly how many rows this function returns depends on which version of SQL Server you are running. If you run the query on Azure SQL Database or Azure SQL Managed Instance, you will get back no less than 5540 collations, which is somewhat more than SQL Server 2022, which has 5508 collations. As for older versions, I will come back to that later in the section The History of Collations in SQL Server. The reason for this frightening number is due to that a collation is a combination of various properties, where about any property can be combined with any other, leading to the same effect as a cartesian join: an explosion of numbers. What these properties are is something we will return to in the chapter The Anatomy of a Collation Name.

But let's first get an understanding of what a collation is: A collation is a set of rules for how to handle string data depending on human language. What does this mean more exactly? Foremost, these are the rules that apply when comparing data. For instance, consider these Boolean expressions:

'I' = 'i'
'V' = 'W' 
'Ö' > 'Z' 
'+' > '-'

Whether these comparisons yield true or false depends on the collation. And lest you think that some of these are trivial and will always have the same outcome, at least in a case-insensitive collation, be warned: In this article you will see examples of when all of these four will yield true or false depending on the collation. You see, different languages do it differently.

Once the rules for comparison have been established, this sets the rules for sorting and grouping. That is, if Ö is > Z, Önnestad will sort after Zagreb, but if Ö < Z, Önnestad will sort before Zagreb. Likewise, if V and W are considered to be different, this query:

SELECT word, COUNT(*)
FROM   (VALUES('Wine'), ('Vine')) AS V(word)
GROUP BY word

will return two rows, but if they are equal, the query will return a single row with a count of 2.

Before we move on, let's make an observation: because the collation controls how string data is ordered, this also means that the collation defines the physical order of indexes on string columns. And this leads to an important property of collations: they are cast in stone. You may find that the collation for your favourite language has a horrible error like sorting F before E. But Microsoft will not change the collation. That is just not going to happen. The reason for this is simple: if they would, this means that they would have to rebuild all indexes based on that collation when you install the fix. And what if that is a multi-terabyte table? It would take a long time to apply the fix and in the end your transaction log may fill up. No, Microsoft will never change a collation. But they may introduce a new corrected collation, and deprecate the old, incorrect, one.

Because collations affect comparison, there is no surprise that the collation affects the result you get from functions such as charindex and patindex. But the collation also affects the result from the functions lower and upper, because different languages do it differently. We will see examples of this later in the article.

Another thing that is affected by collations is the range in a LIKE expression. This sometimes takes people by surprise. They write

WHERE col LIKE '[A-Z]%'

and expect to find words that start in uppercase, because they think of A-Z as an ASCII range, since this is how it works on Unix or in a text editor. But these ranges are also affected by the collation. Later in this article, we will look more at what happens with LIKE ranges depending on the collation, including the solution for how to find words that start in uppercase.

The collation determines the code page used for the char, varchar and text data types, and thereby the character repertoire available with these data types. We will explore this theme in the next chapter.

The final point that a collation determines may be a little more difficult to grasp: which characters that actually are defined. Or dressed in other words, which characters that SQL Server understands. This point is of a technical nature, and is not related to language as such. However, depending on which language you work with, this may be very important to you. Exactly what this means is something we will return to when we look at the version number in the collation name.

So far we have talked about data, that is, what you have in your columns and variables. But the collation also affects metadata. That is, the names of databases, logins, tables, variables or other objects you define in SQL Server. This means that in one database you can refer to MyTable as MyTable, MYTABLE or mytable and get away with it, but in another database only MyTable is accepted. This is certainly a confusing aspect of collations that is due to legacy. As we will see in the next section, this is different in Azure SQL Database.

Where to Define the Collation?

In SQL Server, we can define the collation on four different levels: Server, Database, Column and Expression.

The Server Collation

You set the server collation when you install SQL Server. The server collation defines the collation for the system databases and it is also the default collation for new databases. Furthermore, the server collation is the default collation for string columns in temp tables (but not in table variables). When it comes to metadata, the server collation applies to:

The server collation is very important, and selecting the server collation casually can lead to problems further down the road. Changing the collation after the fact is often very painful. If you run into a situation where you need to do this, check out the accompanying article Changing the Server and Database Collation in SQL Server, where I give suggestions for how to tackle the problem, as well as some scripts you can use.

On Windows, the default for the server collation is taken from system locale, but that doesn't necessarily mean that you will get the collation you want. Or even one that fits your local language. To make matters worse, the collation is tucked away on a secondary tab, so an unfortunate default is not splashed in your face, unless you go looking for it. Here is a screenshot of the critical page in the Setup wizard:

Screenshot collation setup

You should make it a habit to always look at what is in the Collation tab!

In a later chapter, we will look into in more detail how Setup selects the default collation.

To inspect the server collation for an instance you can run this query:

SELECT serverproperty('collation')

Or you can simply look at the column collation_name for the rows in sys.databases for master, tempdb, model and msdb.

The Database Collation

When you create a database, by default, the database will get the same collation as the server collation. However, you can override this by adding a COLLATE clause:

CREATE DATABASE test COLLATE Thai_CS_AS

The database collation defines the collation for string literals and the values of string variables (as opposed to the names of the variables, which are controlled by the server collation). The database collation also serves as the default collation for columns in newly created user tables as well as string columns in table variables (as opposed to temp tables, where the default is the server collation).

When it comes to metadata, the database collation applies to the names of objects created within the database: tables, columns, users, indexes, certificates etc. There is one exception: names of Service Broker objects such as message types, contracts and services, for which the collation is always Latin1_General_BIN. (Because these names need to be interoperable over multiple servers and databases). Names of Service Broker queues follow the database collation, though.

You can change the database collation with the ALTER DATABASE statement, but this only changes the collation of string columns in the system tables (that is, the collations for names of objects). Columns in user tables are not affected, but you need to handle these separately. Beware that there are several conditions that can cause ALTER DATABASE COLLATE to fail. Overall, changing the database collation is often a painful and complicated operation. In the accompanying article Changing the Server and Database Collation in SQL Server, I try to help with this by providing scripts and ideas. But even with this assistance, it is still a major undertaking.

As noted above, the default for a newly created user database is taken from the server collation. If you restore a database on a server with a different collation, the database collation does not change. This can lead to collation conflicts, which we will look into later in this chapter.

To view the collation for a database, you can use this query:

SELECT databasepropertyex('MyDB', 'Collation')

Or you can look at the database in sys.databases.

Column-level Collation

When you create a table, you can use the COLLATE clause to override the collation for a specific column. Here is an example that is a little extravagant.

CREATE TABLE MultiCollation (
    id      int                                    NOT NULL,
    normal  nvarchar(50)                           NULL,
    Lao     nvarchar(50) COLLATE Lao_100_CI_AS     NULL,
    Arabic  nvarchar(50) COLLATE Arabic_CI_AS      NULL,
    Maltese nvarchar(50) COLLATE Maltese_100_CI_AS NULL,
    Mohawk  nvarchar(50) COLLATE Mohawk_100_CI_AS  NULL,
    Kazakh  nvarchar(50) COLLATE Kazakh_90_CI_AS   NULL,
    CONSTRAINT pk_MultiCollation PRIMARY KEY (id)
)

In this table, the column normal will have the same collation as the database collation, so if you create it in the database test above, the collation will be Thai_CI_AS. If you create it in tempdb, the collation for normal will be the same as your server collation.

To view the collations for the columns in a table, you can use sp_help. You can see the collation in the rightmost column of the second result set, as shown here (with some columns minimised to conserve space):

Output from "sp_help MultiCollaton"

Since the collation for the column normal depends on the database collation, it is not unlikely that you are seeing a different collation for this column than above.

You can also query sys.columns:

SELECT name, collation_name
FROM   sys.columns
WHERE  object_id = object_id('dbo.MultiCollation')

You can change the collation of a column with ALTER TABLE. As always with ALTER TABLE, you need to give the full definition, even if you are only changing one thing. Say that you want to change the collation for the Lao column to a case-sensitive collation, and you want to make the Arabic column twice as long:

ALTER TABLE MultiCollation 
      ALTER COLUMN Lao nvarchar(50) COLLATE Lao_100_CS_AS NULL
ALTER TABLE MultiCollation 
      ALTER COLUMN Arabic nvarchar(100) COLLATE Arabic_CI_AS NULL

If you leave out the collation for the Arabic column, the collation of the column will be changed to your database collation!

You may ask what is the use case for setting the collation on column level. Certainly, in many databases, all string columns adhere to the database collation. There can however be business rules that mandate a collation deviating from the database collation. I recall one system I worked with where customers generally used a case-insensitive collation in production. But the system had to receive and store a reference id from an external system that was case-sensitive, so for the columns that could hold these reference ids, we had to mandate a case-sensitive collation. In the upcoming section on collation conflicts we will learn about a special situation with temp tables where setting the column collation in a specific manner may be considered best practice. In the last chapter of this article, I will discuss some radical ideas about collations that lead to a wider use of setting the collation on column level.

Expression-level Collation

Finally, the collation can be set on expression level by applying the COLLATE clause. As a quick example, say that you want to see your data sorted according to the rules of Norwegian. Then you could say:

SELECT ...
FROM   MyTable
ORDER  BY stringcol COLLATE Norwegian_100_CI_AS

We will make frequent use of this clause in this article when we explore the result of different collations. Already in the next section, we will learn that we may need to use this clause to resolve collation conflicts.

An important thing to understand is that when you cast the collation of a column, you will render any index on the column useless. Say that MyTable in the example above has a clustered index on stringcol and that the collation for stringcol is French_CI_AS. French and Norwegian sort letters differently, and the index is organised for French. This means that if you want to see data sorted according to Norwegian rules, the index serves no purpose, but SQL Server needs to add a Sort operator to the plan.

Azure SQL Database

The discussion so far in this section has been for on-prem SQL Server. For the cloud offerings there are some differences.

For Azure SQL Managed Instance the above holds true, except that there is no Setup wizard, obviously. The default server collation will always be SQL_Latin1_General_CP1_CI_AS. (Which is not the best collation, in my opinion. More about this later in this article.)

For Azure SQL Database there are some more differences due to the nature of the services. Here you have no choice of the server collation; it will always be SQL_Latin1_General_CP1_CI_AS. But since the server is only a logical container, this only affects the master database. This collation is also the default for new databases, but you can override this when you create the database. (You can create databases in a plethora of ways. I have not worked with all, but I take for granted that all permit you to set the collation.)

A very important difference in Azure SQL Database compared to on-prem SQL Server is that string columns in temp tables default to the database collation, not the server collation. This eradicates all collation conflicts related to temp tables.

Another important difference is that, by default, in Azure SQL Database the database collation only applies to user data, but not to metadata. For metadata, the collation will always be SQL_Latin1_General_CP1_CI_AS, so that the names of tables, columns, stored procedures etc follow this collation. This is a good thing, since it gives a consistent set of rules for identifiers across databases with different collations. It is possible to override this behaviour, as I will discuss in a later chapter, but it is nothing I recommend.

When it comes to column-level and expression-level collation, there is no difference between on-prem SQL Server and the cloud offerings.

Collation Conflicts

A common pain point with collations is collation conflicts. They are often the result of being neglectful to collations when you installed SQL Server or when you created the database.

Before we look into why this easily happens by accident, let's look at a case where we cause a conflict by purpose. Say that we want to find out if there are any rows where the Maltese and Mohawk columns in our MultiCollation table have the same value:

SELECT *
FROM   MultiCollation
WHERE  Maltese = Mohawk

This results in this error message:

Msg 468, Level 16, State 9, Line 22

Cannot resolve the collation conflict between "Mohawk_100_CI_AS" and "Maltese_100_CI_AS" in the equal to operation.

One of the things that collations control is how to compare data. Here SQL Server finds itself confused. Should it compare the two columns by the rules of Maltese, by the rules of Mohawk or by something else? SQL Server doesn't feel like tossing a coin or having an arbitrary rule like always going by the column on the left side. So instead, it raises an error, and the gist of the message is that we need to give specific instructions to SQL Server on which collation to use. If we want to rules of Mohawk to apply, we can say:

SELECT *
FROM   MultiCollation
WHERE  Maltese COLLATE Mohawk_100_CI_AS = Mohawk

These collation conflicts do not only appear with comparisons on JOIN and WHERE clauses, but any time you mix collations. Here is another example:

SELECT concat('The word for "', Maltese, '" in Mohawk is "', Mohawk, '".')
FROM   MultiCollation

This results in a very similar message to the one above:

Msg 451, Level 16, State 1, Line 25

Cannot resolve collation conflict between "Mohawk_100_CI_AS" and "Maltese_100_CI_AS" in concat operator occurring in SELECT statement column 1.

At first, this seems kind of pointless, since there are no collation rules in play here. But when SQL Server produces a result set, the result set includes metadata about types etc. That metadata includes the collation, and again SQL Server does not want to make a choice of its own.

Here is yet one example:

DECLARE @lang nvarchar(20)
SELECT CASE @lang WHEN 'Lao'     THEN Lao
                  WHEN 'Arabic'  THEN Arabic
                  WHEN 'Maltese' THEN Maltese
                  WHEN 'Mohawk'  THEN Mohawk
                  WHEN 'Kazakh'  THEN Kazakh
                  ELSE normal
       END
FROM   MultiCollation

That is, we have the idea that depending on the language we will return the appropriate column from MultiCollation. But this fails with a collation conflict. A CASE expression has a static data type, and the collation is part of the data type, so for this to work, we need to cast the columns to the one and the same collation.

After this preamble, let's look at a common case where collation conflicts catch people unguarded. Say that you create a database on a server with server collation A and you create a procedure in that database where you create a temp table, and later in that procedure you join that temp table to a permanent table in the database. Everything works fine. But later, you restore a backup of that database on a server with collation B, and this is where your troubles start.

To make the demo shorter, let's say that you have already restored the backup on your instance. This is the database:

CREATE DATABASE SQL_1xCompat_CP850_CI_AS COLLATE SQL_1xCompat_CP850_CI_AS
go
USE SQL_1xCompat_CP850_CI_AS
go
CREATE TABLE Pizzas (id    int          NOT NULL,
                     name  nvarchar(40) NOT NULL,
                     CONSTRAINT pk_Piazzas PRIMARY KEY (id)
)
INSERT Pizzas(id, name)
   VALUES(1, 'Margherita'),
         (2, 'Capricciosa'),
         (3, 'Calzone'),
         (4, 'Quattro stagione'),
         (5, 'Marinara')
go
CREATE PROCEDURE pizza_sp AS
  CREATE TABLE #temp(name nvarchar(40) NOT NULL)
  -- In a real-world procedure we may fill #temp with values from elsewhere, but 
  -- here we just hard-code some values. 
  INSERT #temp(name)
     VALUES ('Margherita'),
            ('CALZONE')

  SELECT id, name
  FROM   Pizzas
  WHERE  name IN (SELECT name FROM #temp)
go

For the purpose of this demo, I've chosen a collation which I assume to be very rarely used in order to create a situation where the server and database collation are different. Would your SQL Server instance actually have this collation as its server collation, you need to select a different database collation for the demo to work as intended.

Now run the stored procedure:

EXEC pizza_sp

And again, we have error message 468. (If you run this on your machine, the first collation in the error message is likely to be different, as it will be your server collation.)

Msg 468, Level 16, State 9, Procedure pizza_sp, Line 7 [Batch Start Line 60]

Cannot resolve the collation conflict between "Finnish_Swedish_100_CS_AS_SC_UTF8" and "SQL_1xCompat_CP850_CI_AS" in the equal to operation.

The reason this happens is that the default collation for columns in temp tables is the server collation.

From what we learnt above, we need to cast the collation on one side. It is quite logical that we want the temp table to follow the collation of the permanent column. Thus, we can change the procedure to read:

ALTER PROCEDURE pizza_sp AS
  CREATE TABLE #temp(name nvarchar(40) NOT NULL)
  INSERT #temp(name)
     VALUES ('Margherita'),
            ('CALZONE')

  SELECT id, name
  FROM   Pizzas
  WHERE  name IN (SELECT name COLLATE SQL_1xCompat_CP850_CI_AS FROM #temp)
go
EXEC pizza_sp

This runs successfully. However, you may feel uncomfortable with hardcoding the collation, because tomorrow this application could be deployed at a different customer with a different requirement for the collation. Thankfully, there is a solution for this. Rather than a collation name, COLLATE can be followed by DATABASE_DEFAULT, which means that you want to cast the column to the database collation, whichever it may be. Thus:

ALTER PROCEDURE pizza_sp AS
  CREATE TABLE #temp(name nvarchar(40) NOT NULL)
  INSERT #temp(name)
     VALUES ('Margherita'),
            ('CALZONE')

  SELECT id, name
  FROM   Pizzas
  WHERE  name IN (SELECT name COLLATE DATABASE_DEFAULT FROM #temp)
go
EXEC pizza_sp

Again, this runs successfully. However, there is still room for improvement. Rather than casting the collation in every expression, it is better to set the collation when you create the temp table:

ALTER PROCEDURE pizza_sp AS
  CREATE TABLE #temp(name nvarchar(40) COLLATE DATABASE_DEFAULT NOT NULL)
  INSERT #temp(name)
     VALUES ('Margherita'),
            ('CALZONE')

  SELECT id, name
  FROM   Pizzas
  WHERE  name IN (SELECT name FROM #temp)
go
EXEC pizza_sp

Generally, best practice is to use COLLATE DATABASE_DEFAULT for string columns in temp tables, to avoid the pain when you copy databases between different instances.

It follows from the previous section that this problem does not apply to Azure SQL Database, where string columns in temp tables take their default from the database collation.

Before we leave this topic, let's study more cases, so that we get a full understanding of what happens when you mix collations. What outcome do you expect here:

ALTER PROCEDURE pizza_sp AS
  CREATE TABLE #temp(name nvarchar(40) NOT NULL)
  INSERT #temp(name)
     VALUES ('Margherita'),
            ('CALZONE')

  SELECT id, name
  FROM   Pizzas
  WHERE  name IN (SELECT name COLLATE Maltese_100_CS_AS FROM #temp)

That is, we cast the collation of the column to the temp table to a third collation, which is different from both the database collation and the server collation. (For reasons that will prevail later, I know that your server collation is not Maltese_100_CS_AS.) We have learnt that when two collations meet, SQL Server finds it difficult to make a choice and raises an error. So it seems logical that this would happen here, since we have SQL_1xCompat_CP850_CI_AS on the left side and Maltese_100_CS_AS on the right side. But, no, the procedure runs successfully. However, there is one difference in the output: In the previous executions, we got two rows back, Margherita and Calzone. But this time we only got back Margherita. This is the scoop: when you cast the collation of one column in an expression, this collation cast applies to all columns in the expression. And since Maltese_100_CS_AS is a case-sensitive collation, we did not get a hit on Calzone.

So far we have looked at examples with two columns. Let's now look at what happens if we compare a column to a literal value or a variable:

CREATE TABLE #temp(name nvarchar(40) COLLATE French_CS_AS NOT NULL)
INSERT #temp(name)
   VALUES ('Margherita'),
          ('CALZONE')
DECLARE @p nvarchar(20) = 'calzone'
SELECT name FROM #temp WHERE name = @p
SELECT name FROM #temp WHERE name = 'calzone'

Recall that variables and literals follow the database collation, so there are two different collations in play here as well. In contrast to when we compared two columns, this runs without error. However, no rows are returned. This is because when a column meets with a variable or a literal, the collation of the column takes precedence. And in this example, the column has a case-sensitive collation, so Calzone does not match CALZONE.

It is also worth mentioning that you can cast the collation of a parameter or a literal as well. For instance:

SELECT name FROM #temp WHERE name = 'calzone' COLLATE DATABASE_DEFAULT

This returns the row with CALZONE. Above, I said that when you cast the collation for one column, it applies to all columns. More generally, if you cast the collation of one sub-expression in a larger expression, this collation cast applies to all sub-expressions that make up the bigger expression. (What if you force the collation for more than one sub-expression? If you force a different collation, you get a collation conflict.)

If you want to read more about collation conflicts and how SQL Server resolves them, see the topic Collation Precedence in the SQL Docs. That text takes a much more formal approach than I have done here.

A History of Character Sets, Code Pages and Collations

In the Beginning There Was ASCII...

As you know, everything in a computer is bits that are either 0 or 1. From these bits we can build bytes, words, quadwords etc that can represent integer numbers of the desired length. And with some conventions, these bit patterns can also represent floating-point numbers, decimals, date and time etc. But how to represent character data? The long-standing solution in computing has been to map each character to a number. These numbers are often referred to as code points, a term I will use throughout this article.

The title for this section suggests that it all began with ASCII. That is not true; there were character encodings before ASCII. However, we don't need to concern ourselves with any of these, because for all practical matters ASCII is the starting point for all character processing in modern computing. (Well, maybe there are still some mainframes that use EBCDIC, but they are of no interest in the context of SQL Server.)

ASCII is a 7‑bit character set of 95 printable characters, assigning a character to each number from 32 to 126. Remaining characters in the range 0-127 are control characters, of which some are very important in computing, while others are obsolete today. Looking at the ranges in hex, we find this division:

0x20-0x2F
This range starts with Space followed by a number of punctuation characters (punctuation in the broad sense, including mathematical characters).
0x30-0x3F
This range starts with the ten digits 0 to 9, followed by more punctuation.
0x40-0x5F
Mainly uppercase letters, but the range starts with one punctuation character, the famous cinnamon bun (as we call it in Swedish) @, and there are some more punctuation characters at the end, including the brackets, [].
0x60-0x7E
Mainly lowercase letters, but again one punctuation character at the start and a few more at the end, amongst others, the braces, {}.

The A in ASCII stands for American, so not surprisingly it is a character set designed for English. This means that it was insufficient for speakers of other languages that use letters not in the English alphabet. The solution to this was to replace some or all of the punctuation characters in the range 0x40-0x7E with these extra letters to create national variations. ASCII was standardised by ISO as ISO-646, and these national variations had the country code at the end. For example, the Swedish variation was known as ISO-646-SE. These variations are dead now. Or at least they should be. This is a photo I took in the summer of 2021 of the display of a cash machine in a café on the street where I live.

Image of cache machine saying "sm|rg}s" when it should say "smörgås", Swedish for "sandwich"

Yes, it says Sm|rg}s. And for many young Swedes this is just some piece of garbage, even if they are into IT. And well it is garbage. But I've been around with computers for some time, and I have been through the period where it was commonplace to see ISO-646-SE being represented in plain ASCII. So I am perfectly used to reading this as Smörgås. (Which means sandwich. We will talk more about Swedish sandwiches later in this text, more precisely shrimp sandwiches for reasons that will prevail.)

Thankfully, that cash machine is an exception. The occasions I have run into Swedish in 7‑bit ASCII in the third millennium are far and few in between. And the café has a new cash machine these days.

The Competing 8-bit Character Sets

As you can imagine, the solution with national variations of 7‑bit ASCII was highly impractical. Obviously, I wanted the computer to display Swedish text correctly to me, but this also meant that when I worked in Pascal (which was my main language at the time), I had to look at things like:

MyArrayÄiÅ := 98;

rather than

MyArray[i] := 98;

You get used to such things, but pretty it is not. And what if I wanted to mix Swedish and, say, French and Polish in the same text? Just forget it.

As for why ASCII only used seven bits there were a number of reasons that are not of interest here, but see the Wikipedia article about ASCII if you want to know. In the early 1980s, vendors started to realise that there was no valid reason to not use all eight bits of a byte. Using the full byte would remove the need to use replacement characters, and permit brackets and braces to remain brackets and braces. And it would be possible to mix languages in a text, at least to some extent.

Unfortunately, about every vendor composed their own character set. They all retained ASCII for the range 32-126, but for the range 128-255, everyone had their own idea. This made it precarious to move text from one platform to another. For instance, if a user on HP, using their ROMAN8 would type räksmörgås (a Swedish word meaning shrimp sandwich, and which features all three characters of the Swedish alphabet that are not in the English alphabet) it would appear as rÌksmÎrgÔs on my VT220 using the DEC Multinational Character Set, DEC MCS.

Eventually, this mess was resolved by the introduction of the standard ISO 8859 (which was based on DEC MCS). ISO 8859 is a family of character sets, with each set targeting a certain group of languages. Probably the most commonly used is ISO‑8859‑1, known as Latin‑1 which supports most languages in the Western world as it was known at the time. (Recall that this was the 1980s, and the Iron Curtain still stood.) ISO‑8859‑2 or Latin‑2 is for languages in Eastern Europe. ISO‑8859‑3 and ‑4 – Latin‑3 and Latin‑4 – target languages with a smaller number of speakers. ISO‑8859‑5 to ISO‑8859‑8 are for the Cyrillic, Arabic, Greek and Hebrew scripts in that order. Noteworthy is also ISO‑8859‑9 or Latin‑5, which is very similar to Latin‑1, but which rips out letters used by Icelandic and Faroese in order to support Turkish instead. (Which originally was intended to be supported by Latin‑3.) For a complete list of ISO-8859 character sets, see Wikipedia.

Many characters appear in more than one of the ISO-8559 sets. Typically, a character has the same code point in all sets it appears in (but there seem to be some exceptions in the sets added after ISO‑8859‑9). So, for instance, if I type räksmörgås in Latin‑1, a Polish user with Latin‑2 will see räksmörgĺs. That is, ä and ö are both in Latin‑1 and Latin‑2, but å is only in Latin‑1 and Latin‑2 has a completely unrelated character at that code point. Only one character appears in all sets and that is no-break space at code point 160 (0xA0). The soft hyphen (0xAD) is in almost all sets, with ISO‑8859‑11 for Thai as the sole exception.

Like ASCII, Latin‑1 has the characters nicely divided into ranges:

0x80-0x9F (128-159)
Reserved for control characters and they are not assigned to any printable character.
0xA0-0xBF
Punctuation characters.
0xC0-0xDF
Uppercase letters (and one punctuation character).
0xE0-0xFF
Lowercase letters (and one punctuation character).

The other sets are not equally neatly arranged. All of them leave the range 0x80-0x9F undefined, but some sets have letters also in the range 0xA0-0xBF – for the simple reason that they need to support more letters than fit into the upper ranges. The Cyrillic set consists almost only of letters. On the other hand, the Hebrew and the Arabic sets leave many code points undefined.

Enter Unicode

The introduction of the 8‑bit character sets certainly was an improvement. I could now write code and use brackets and braces correctly and still write my comments in Swedish if I wanted to. Furthermore, if I wanted to mix Swedish and French in a text, that was now possible.

But it was certainly not perfect. For instance, if I wanted to mix Swedish and Polish in a text, that was still kind of hard. And if you look at the scripts and languages mentioned in the previous section you may note there are some major languages missing: Chinese, Japanese, all the languages of India etc. Not that there weren't character sets to support these languages, but they were not covered by ISO.

Therefore, computer vendors again sat down and said We need something better. We need a character set that supports all languages. Thankfully, this time they decided to work together and formed the Unicode Consortium. There was still competition, as ISO worked on their own initiative, ISO 10646. But Unicode had the momentum, although I believe that it picked up some influences from the original ISO 10646 proposal. While Unicode is mainly known by that name, Unicode is also standardised by ISO as, yes, ISO 10646.

Unicode is a 21‑bit character set. Or maybe it's something like a 20.5-bit character set. That is, the highest code point that can be used is 0x10FFFF and not 0x1FFFFF. In total there are a little more than 1.1 million possible code points. As of this writing, around 155 000 code points are defined. New versions of Unicode keep coming out, each adding more characters. (You may find this limit of 0x10FFFF odd, but there is a technical reason for this limit, which we will come back to when we look at surrogate-compatible collations.)

While this may seem a frightening lot, it is important to know that the so-called base plane, 0-65535 (0x0000-0xFFFF) covers all living languages. Beyond the base plane, there are 16 supplementary planes, each consisting of 65536 code points. Of these, the first is known as the Supplementary Multilingual Plane and the second is the Supplementary Ideographic Plane. Remaining planes are fairly sparsely used or not used at all. Within the planes, characters are divided into blocks, containing a set of related characters. For examples, see the article Plane (Unicode) in Wikipedia. Trying to learn all the ranges in Unicode is about impossible, but there are two you should know. The first is the range 32-126 which is, yes, ASCII. And the range 160-255 is taken directly from ISO-8859‑1 or Latin‑1.

Characters in Unicode are often referred to as U+XXXX, where XXXX is the hex value for the code point. Below are a few samples (to the left is text and to the right there is an image, in case some characters do not display for you).

  • U+0041 = A. An ASCII character.
  • U+00F6 = ö. A lowercase character from Latin‑1, used in Swedish and other languages.
  • U+0141 = Ł. A character from Latin‑2, used in Polish.
  • U+03C0 = π. Lowercase Greek character "pi",
  • U+5134 = . The Chinese character "yuan".
  • U+1F733 = 🜳. Alchemical symbol for Regulus-2.
Image of the list to the left.

For characters in the base plane, the notation always uses exactly four hex characters. Only for characters in the supplementary planes five (or six) hex characters are used. (Finding an example which requires a sixth hex digit is not simple, since the plane for U+100000 to U+10FFFF is a so-called Private Use Area plane, which means that the Unicode Consortium themselves will never define any characters in this plane.)

Unicode as such only defines code points, but how do we store them in a computer? 21‑bit entities are not really practical. No, to store Unicode characters, you need to apply an encoding, and there are a couple of them:

UTF‑32
Each character is stored in a 32-bit double-word, which makes for a fixed length encoding. This encoding is not very widely used, but there seems to be a few programming languages that use UTF‑32 internally.
UTF‑16
Characters in the base plane are encoded with 16‑bit numbers, whereas characters in the supplementary planes are encoded with four bytes in so-called surrogate pairs, which we will look into later when we discuss SC collations. Windows uses UTF‑16 internally, and the data types nchar, nvarchar and ntext are stored as UTF‑16 in SQL Server.
UTF‑8
A variable-length encoding where a code point is encoded with one, two, three or four bytes. The encoding is constructed in such a way that a file with only ASCII characters is also a valid UTF‑8 file. This is a very popular encoding in the Western world, and there are many operating systems and programming languages that use this encoding. We will look more at UTF‑8 when we discuss UTF‑8 collations.
GB 18030
This encoding is the official character set of the People's Republic of China. I will have to admit that I am not familiar with how it is constructed, but supposedly it is more efficient to use with Chinese than UTF‑8 (where each Chinese character requires three bytes).

There are a few more UTF-nn out there, but they are not widely used and/or are obsolete.

There is a complication with UTF‑16 (and UTF‑32): Depending on the architecture, the bytes in a number can come in different order. In a big-endian architecture, U+0041, that is A, is stored as exactly that 0x0041. But in a little-endian architecture, the bytes are swapped, so you get 0x4100 instead. In practice, this means that there are two UTF‑16 encodings, UTF‑16BE and UTF‑16LE. How would a text editor know which encoding that is used? Heuristics can be used to determine the order, and this can work well for a text in a language that uses the Latin script, as the many ASCII characters give a hint of what is going on. I would expect this to work less well, if at all, for a text in Georgian (a language that uses its own script).

To resolve this more robustly, UTF‑16 files often start with a byte-order mark, BOM. This is character U+FEFF, ZERO WIDTH NO-BREAK SPACE (which is a character that only makes sense when it appears between two other characters, so it would never appear first in a text). With UTF16-LE it appears as FFFE. There is no U+FFFE, or more precisely, Unicode defines this code point as a "non-character". For UTF‑8 there is no endianness issue, because UTF‑8 always encodes the characters as a sequence of bytes. But since a UTF‑8 file could be taken for a file encoded in a legacy character set such as ISO-8859‑2, it is not uncommon to add a BOM to UTF‑8 files as well. If you look at a file in hex editor and you see that the first three bytes are EF BB BF, you are looking at a UTF‑8 file with a BOM. I learn from Wikipedia that in GB 18030 the BOM appears as the sequence 84 21 95 33. How common it is to use BOM with GB 18030, I don't know.

I have been using the word räksmörgås as an example a few times (and you will see the word several times more in this text). If someone writes räksmörgås in a file with UTF‑8 encoding, and I open it in a text editor that assumes it is a file in Latin‑1, it will display as räksmörgÃ¥s. While it is about impossible to learn to recognise the exact mapping, these sequences of two non-ASCII garbage characters is a sure token of UTF‑8 being interpreted as a legacy character set.

On the other hand, if I open a file in Latin‑1 as if it was a UTF‑8 file, this is how the shrimp sandwich may appear as (again with text to the left and image to the right):

r�ksm�rg�s Image of text where Latin-1 is incorrectly taken to be UTF-8

The rhombus with a question mark is the Unicode character U+FFFD, REPLACEMENT CHARACTER. Many environments display this character when they identify an encoding error. In UTF‑8, a byte in the range 0x80-0xFF cannot appear on its own, but bytes with the high bit set must always come in sequences of two, three or four. I should add that the exact display of this character can vary, and some environments may also opt to display no character at all (resulting in rksmrgs or something even weirder).

When working with Unicode text, you may also see one or more open rectangles as this example:

Text where some characters are unavailable in the font and replaced with a generic glyph

This happens when the font or your program does not have a glyph for the character in question and instead displays a fallback glyph. Depending on your font and your browser, you may see more than one open rectangle when reading this article and that is why I also include the intended text in images, and where applicable, also as scripts.

Microsoft and Their Code Pages and SQL Server

You may have noticed that in the section about the 8‑bit character sets that I did not mention Microsoft. But obviously, since this is a text about a Microsoft product, there is all reason to look at what Microsoft did in this area.

In the 1980s, Microsoft, as you may recall, worked closely with IBM and their PC. And from IBM, Microsoft inherited the term code page for a character set. The original IBM PC used code page 437, CP437, which uses more or less the full 8‑bit range from 0 to 255 for printable characters. The code page includes a selection of letters used in West European languages, but there is also an abundance of box-drawing characters.

Later, Microsoft introduced CP850 with a more complete support for the languages of Western Europe, as well as code pages such as CP852 to support the languages of Eastern Europe and CP855 for Cyrillic to name a few. These code pages have roughly the same character repertoires as the corresponding ISO-8859 sets, but just like the other competing character sets of the eighties, the characters are in different places. So räksmörgås entered in Latin‑1, would be displayed as rõksm÷rgÕs on MS-DOS. If the shrimp sandwich travelled in the other direction, it was even worse, because in CP850 the letters åäö are all in the range 0x80-0x9F which are control characters in Latin‑1.

When Microsoft moved onto Windows, they embraced ISO-8859, but only to some extent. For Western Europe, there is CP1252, which agrees with ISO-8859‑1, but in an attack of horror vacui, they added printable characters in the range 0x80-0x9F. For Eastern Europe, there is CP1250 which has some rearrangements compared to Latin‑2. CP1251 for Cyrillic is not very resemblant of ISO-8859‑5 at all. Microsoft also has code pages for Japanese (CP932) and Chinese (CP936) as well as code pages for a few more languages. There is also code page 65001 for UTF‑8.

Now, let's bring this down to SQL Server. When you use the nvarchar data type, you have access to the full range of the Unicode character set. On the other hand, when you use varchar, you only have access to the characters in the code page that is tied to the collation in question. Let's look at this in some demos. If you have any problems with any of the non-ASCII characters in the demos, you also find these scripts in the file codepages.sql in the zip file collationscripts.zip.

Important note: For brevity, I said nvarchar and varchar in the paragraph above, but it is tacitly understood that what applies to nvarchar also applies to the nchar and ntext data types, and analogously what applies to varchar, also applies to the char and text data types. This contrast between nchar/nvarchar/ntext on the one hand and char/varchar/text on the other, is something I will return to many times in this article, and in most places I will only say nvarchar or varchar and make the same tacit understanding in interest of conciseness.

The first step is to create a database with a specific collation and find what is the code page for that collation.

CREATE DATABASE Latin1_General_CI_AS COLLATE Latin1_General_CI_AS
go
USE Latin1_General_CI_AS
go
SELECT collationproperty(convert(nvarchar(128),
                            databasepropertyex(db_name(), 'Collation')),
                        'CodePage')

Note: The reader may find this query overly convoluted. The convert to nvarchar is required, because the function databasepropertyex returns sql_variant, and there are no implicit conversions from that type to other types.

The query returns the value 1252, which, as I mentioned earlier, is the code page for ISO Latin‑1.

Next, we will run a SELECT query that returns four words. The first word is the Swedish word for shrimp sandwich which you have already seen a couple of times. The second is the Polish word for sun. You may recognise the third word as the first four letters of the Greek alphabet. The last word is Nakameguro, a district in Tokyo, written in Kanji.

SELECT N'räksmörgås' AS shrimps,
       N'słońce'     AS sun, 
       N'αβγδ'       AS Greek,
       N'中目黒'     AS Nakameguro

The result is how set looks like this:

Display of the three words as nvarchar

That is, all four words are returned without changes. This may not seem very startling, but it is important to understand why it happens. Because all string literals start with N, the data type of the literals is nvarchar, which means that we have access to all characters in Unicode. (Well, more precisely, in the Unicode base plane with this collation. More about that later.)

If we remove the N, the data type will be varchar, and look now what happens.

SELECT 'räksmörgås' AS shrimps,
       'słońce'     AS sun, 
       'αβγδ'       AS Greek,
       '中目黒'     AS Nakameguro

This is the output:

Display of the three words as nvarchar

As you can see things have changed. With varchar, we only have access to the characters in the code page of the collation, and as we saw above, this is code page 1252. The Swedish räksmörgås is unchanged. Since CP1252 is for Western European languages, it includes all characters needed to write Swedish. But the Polish słońce has been changed to slonce. The two Polish characters ł and ń are not available in CP1252 so they are replaced by fallback characters. Often a fallback character is selected so it is a lookalike, similar to the base character, and this pattern is used for the Polish letters. (You could say that the accents are removed, and while that certainly is true from an English perspective, you should know that ł and ń are letters in their own right in the Polish alphabet.) The Japanese name shows another pattern. In this case, there is nothing resemblant to the Kanji characters in code page 1252, so they are replaced with the question mark, which is the generic fallback character, used when nothing else fits.

I included the Greek letters, because they demonstrate that the choice of fallback characters can be quite whimsical. Many readers may expect the result to be abgd, but as we can see this does not happen for beta and gamma. Beta is replaced by the German "scharfes s". Certainly a lookalike, but there is no logical connection between the two. And for gamma, we get the generic fallback character. So why is this? Well, there is a good reason for not replacing beta with b and gamma with g. To wit, in modern Greek, beta is pronounced as v and gamma as a voiced version of ch in Scottish Loch. But there is no consistency here, since the modern pronunciation of delta is like English th in this.

Before we move on, I like to make a quick recap on the various fallbacks we have seen:

Let's now create a database with a Japanese collation and look at its code page:

DROP DATABASE IF EXISTS Japanese_CI_AS
CREATE DATABASE Japanese_CI_AS COLLATE Japanese_CI_AS
go
USE Japanese_CI_AS
go
SELECT collationproperty(convert(nvarchar(128),
                            databasepropertyex(db_name(), 'Collation')),
                        'CodePage')

The return value is 932, the code page for Japanese. Let's now run the two SELECT statements again.

SELECT N'räksmörgås' AS shrimps,
       N'słońce'     AS sun, 
       N'αβγδ'       AS Greek,
       N'中目黒'     AS Nakameguro

SELECT 'räksmörgås' AS shrimps,
       'słońce'     AS sun, 
       'αβγδ'       AS Greek,
       '中目黒'     AS Nakameguro
Two SELECT statement with Swedish, Polish, Greek and Japanese

This is the output:

The output from the two queries above

The SELECT with nvarchar does not change, but the SELECT with varchar does. The Swedish word is changed into raksmorgas. That is, the Swedish letters are replaced by lookalikes. The Polish word is also altered, but for some reason the generic fallback character is used rather than the lookalikes that we had above. (Please don't ask why!) On the other hand, the Greek and Japanese words are returned unchanged. The fact that the Japanese name is presented correctly is quite logical: it's after all a Japanese collation. The Greek characters are more surprising, but apparently CP932 includes some Greek characters as well. Presumably, because Greek letters are often used in mathematics and physics.

Now, think of this a little while. There are 256 code points available, with the lower half being reserved for ASCII and control characters. As you may know, Kanji is an ideographic writing system where each character is basically representing a word, and there are thousands of Kanji letters. How can they squeeze in all these in just 128 code points and add some Greek to it? Here is a script to explain what is going on.

CREATE TABLE words (word varchar(8) NOT NULL)
INSERT words(word)
   VALUES('中目黒'), ('Yokohama'), ('αβγδ')
SELECT word, len(word) AS charlen, 
       datalength(word) AS bytelen
FROM   words
Image of the code to the left.

To understand this query, you should know that the len function returns the number of characters (excluding trailing spaces), whereas the datalength function returns the size in bytes. This is the output from the query.

Result set from the query above.

We can see that Yokohama, written in Latin letters, is eight characters and eight bytes. Whereas Nakameguro in Kanji is three characters, but six bytes. Likewise, the Greek letters are four characters, but they take up eight bytes. So that is how they do it! In code page 932, the lower range 0-127 is one byte per character, just like ASCII. But in the higher range, they combine two bytes to compose a single character. You might have been thinking varchar = 8‑bit characters, but as you can see, this is not correct for all code pages.

Now that we have seen that in this code page, characters are variable length, we may ask ourselves what does varchar(8) actually mean? Let's try it! We will try to add the name of a subway station in Tokyo, more precisely the Shinjuku-nishiguchi station, which is close to the Shinjuku railway station, the busiest train station in the world.

INSERT words(word)
   VALUES('新宿西口駅')
INSERT Shinjuku-nishiguchi station.

Alas, it ends in tears:

Error message "String or binary data would be truncated".

That is, varchar(8) means eight bytes, not eight characters. This may not always be practical, but this is what the Japanese people have had to put up with ever since support for Japanese was introduced in SQL 6.5.

We will now move to a different part of Asia, to wit, the Indian subcontinent. Let's start with creating a database with an Indic collation:

CREATE DATABASE Indic_General_100_CI_AS COLLATE Indic_General_100_CI_AS

But, wait, there is an error message:

Msg 453, Level 16, State 1, Line 60

Collation 'Indic_General_100_CI_AS' is supported on Unicode data types only and cannot be set at the database or server level.

OK, but what if we try to create a table with this collation. It works when we use nvarchar:

CREATE TABLE Indic(a nvarchar(20) COLLATE Indic_General_100_CI_AS)

But when we try varchar instead:

CREATE TABLE Indic2(a varchar(20) COLLATE Indic_General_100_CI_AS)

We get a similar error message to the one above:

Msg 459, Level 16, State 2, Line 19

Collation 'Indic_General_100_CI_AS' is supported on Unicode data types only and cannot be applied to char, varchar or text data types.

Why is it so? Partly, the answer to that question is found in the output to this query:

SELECT collationproperty('Indic_General_100_CI_AS', 'Codepage')

The query returns 0. That is, there is no code page tied to this collation. The follow-up question is obviously, how come? I don't know the full story, but it seems that Microsoft does not have any code page for Indic languages.

There are more collations than the Indic_General_100 collations that are only supported for nvarchar. You can find the full list with this query:

SELECT * FROM sys.fn_helpcollations() WHERE collationproperty(name, 'CodePage') = 0

In total, this query returns 510 collations. If you want to see which group of languages that are affected, you can modify the query to only look at one type of collation per group:

SELECT * 
FROM   sys.fn_helpcollations() 
WHERE  collationproperty(name, 'CodePage') = 0
  AND  name LIKE '%CI_AS'

This returns 15 rows. For reference, I have added the country/area where the language is spoken where it may not be obvious from the name.

Besides not being permitted to be used for the database collation, they are equally disallowed for the server collation. (That is why I said earlier that I knew that your server collation is not Maltese_100_CI_AS.)

You may ask, what if people who are speaking these languages want to have a database with a collation for their language? That question was without an answer until the release of SQL 2019 which introduced the UTF‑8 collations, which we will return to later.

In case you wonder about this 90/100 business – this is something we will return to when we look at version numbers.

The History of Collations in SQL Server

In the early days, SQL Server did not have collations, but instead there were sort orders. A sort order defined the same type of rules as a collation does, but it only applied to the char, varchar and text data types – there were no Unicode data types back then. Furthermore, you did not have the choice of setting a specific sort order for a column or a database. Instead, when you installed SQL Server you selected a sort order that applied to the entire server. Just like collations, a sort order was tied to a code page. This set of code pages included not only ANSI code pages like CP1250 and CP1252, but there were also sort orders based on the proprietary code pages such as CP437 and CP850.

Interestingly enough, it was possible – and supported! – to define your own sort order. This was actually something I took advantage of. In our shop we wanted a case-sensitive sort order for Swedish, but the only such sort order that Microsoft shipped was for CP850, which does not play well with a Windows-based application. Therefore, I defined a case-sensitive sort order for Swedish based on CP1252, Windows Latin‑1.

Up to SQL 6.0, all sort orders were tied to a single-byte code page for the Western world and varchar was truly 8‑bit. SQL 6.5 introduced sort orders for Chinese, Japanese and Korean that used multi-byte code pages. So that is how long ago varchar stopped being a 8‑bit character type.

In SQL 7, Microsoft introduced the nchar, nvarchar and ntext data types for Unicode support. They did not extend the existing sort orders to these data types. Instead, when you installed SQL Server, you had to select one sort order for varchar and one Unicode collation (as the term was) for nvarchar. As I reviewed the SQL 7 documentation for this article, I realised there was nothing that said they had to match. To amuse myself, I created a VM with some ancient operating system and dug out the installation media for SQL 7 to install an instance with a Japanese sort order for varchar and a Swedish collation for nvarchar. On top of all, I opted for a case-insensitive sort order for varchar and a case-sensitive collation for nvarchar. See this screen from old Query Analyzer for a truly confusing and unique experience:

Output of a query on SQL 7.

Still with SQL 7, once you had selected your combination of sort order and Unicode collation, it applied to the entire server. This made things quite painful if you had two applications that required different sort orders or collations. You may think yeah, just install a second instance or spin up a new VM. But SQL 7 did not support named instances (they were introduced in SQL 2000), so you needed a second computer. And that had to be a physical computer, since back in those days, virtual machines were not the commodity they are now.

With SQL 2000, Microsoft introduced the collation concept as we know it today. You had to select one collation when you installed SQL Server (as opposed to the funny combo sort order + Unicode collation in SQL 7). This collation no longer had to be universal for the instance, since SQL 2000 was also the version where Microsoft introduced the ability to set the collation on server, database, column and expression level independently that we looked at in the previous chapter.

With the introduction of collations, Microsoft divided them into two main groups: Windows collations and SQL collations. The Windows collations were the Unicode collations from SQL 7, now extended to also cover varchar. And extended is really the word here. In a Windows collation, all operations on varchar are carried out by converting the data internally to UTF‑16 and back. Thus, the exactly the same rules apply to varchar and nvarchar. It is worth noting that this internal conversion adds some overhead, so with a Windows collation, string operations are generally somewhat slower with varchar, contrary to what you may expect.

As for the name "Windows collation", it's a name that may be a little confusing. Books Online says: Windows collations define rules for storing character data that's based on an associated Windows system locale. It took me many years to understand what they are trying to say here. It is not the case that SQL Server asks Windows every time it needs to perform an operation that includes a collation – that would not be efficient. Rather, the SQL Server team has piggybacked on work done by the Windows team and uses their libraries, possibly adapted for SQL Server. (I have no information on the exact technical details.) This puts the Windows collations in opposition to the SQL collations, which have their origins within SQL Server itself and whence the name. We will come back to the connection to Windows locales in the section, Collations Families and Windows Locales, at the end of the long chapter where we dissect the names of Windows collations.

You can easily tell whether a collation is a Windows collation or an SQL collation: the names of SQL collations start with SQL_, the names of Windows collations do not. The SQL collations were created by marrying an old sort order with a matching Unicode collation. This means that there is one library for varchar and another library for nvarchar. While the libraries are for the same language community, there are subtle differences in the rules, and in contrast to the Windows collations you can get different results for varchar data than you get for nvarchar data. Since the libraries for varchar only cover 255 characters (there are no SQL collations for multi-byte code pages), string operations on varchar are faster than operations on nvarchar. For most operations this is a moderate gain in performance, but there is one situation where the difference is drastic, as we will see in the performance chapter.

Since SQL 2000, the number of collations has grown. This the number of rows returned by sys.fn_helpcollations on each version of SQL Server since SQL 2000:

SQL 2000 754
SQL 2005 1011
SQL 2008 2397
SQL 2012 3885
SQL 2014 3887
SQL 2016 3887
SQL 2017 3955
SQL 2019 5508
SQL 2022 5508
Azure SQL Database 5540

As you can see, there has been quite an explosion. One reason for this is that when Microsoft introduces a new collation feature, it multiplies with the existing collations. For instance, the increase from SQL 2017 to SQL 2019 is entirely due to the introduction of the UTF‑8 collations.

This growth is almost entirely confined to the Windows collations. SQL 2000 had 74 SQL collations, whereas SQL 2022 has 77 official SQL collations and one more deprecated SQL collation. This is not very strange; the SQL collations are legacy collations, intended to support old systems that saw the light of day on early versions of SQL Server, so that they can continue to function without need to rebuild the database.

That said, of all these 5540 collations, there is one collation which is the primus inter pares, the king of the hill, the collation which is likely to be more widely used than any other collation, and that is an SQL collation: SQL_Latin1_General_CP1_CI_AS. This collation:

The collations returned by sys.fn_helpcollations are only the officially supported collations. As I mentioned, there is one deprecated SQL collation (which I determined by comparing the set of SQL collations in each version from SQL 2000 and up). There are also deprecated Windows collations. I was able to find these collations with this help of this query (which requires SQL 2022 or Azure SQL Database/Managed Instance due to the use of the function generate_series):

WITH all_collation_names(name) AS (
   SELECT DISTINCT convert(nvarchar(128), CollationPropertyFromId(value, 'Name')) AS name
   FROM   generate_series(1, power(2, 19))
)
SELECT name
FROM   all_collation_names
WHERE  collationproperty(name, 'Codepage') IS NOT NULL
EXCEPT
SELECT name  
FROM   sys.fn_helpcollations()
ORDER BY name

The function CollationPropertyFromId is undocumented, but I learned about it by running sp_helptext on sys.columns. It seems that the id it accepts as argument is some sort of a bit mask, and from that bit mask it constructs a collation name. It turns out that the same collation name can result from more than one input number, whence the DISTINCT. As for the upper limit of 219, I arrived at it by simple testing. Moving from 218 to 219 added more collations returned by the CTE all_collation_names, but moving to 220 did not.

Running the function as such with this range, I got back 16 930 distinct collation names, but I realised that this included a lot of non-existing collations. The filter on collationproperty serves to filter out these names. Finally, this result is reduced against sys.fn_helpcollations with help of the EXCEPT operator. If you run the query, you will find that it returns 172 Windows collations. Why these collations are deprecated, I don't know for each and every case, but obviously there is something wrong with them technically – or maybe in some case – politically. In any case, for the rest of the article, with an exception or two, we will only work with the official collations returned by sys.fn_helpcollations.

In the next chapter, we will take a closer look at Windows collations and all the different properties associated with a Windows collation. This is a long chapter, and it is followed by a considerably shorter one that looks into the SQL collations in more detail.

The Anatomy of a Collation Name

In this chapter we will dissect the names of the Windows collation to understand what all those letter combinations mean and what effect they have. As a starting example, here are some collation names which includes most of the possible components:

The names fall into these parts:

In the rest of the chapter, we will look into all these parts in more detail.

Collation Families

A collation family is a set of collations that share the same basic rules that are based on the rules of one or more languages. That is, the collation family defines:

There are 88 collation families which you can review with help of this query:

WITH FamVersions AS (
   SELECT DISTINCT substring(name, 1, len(name) -  6) AS name,
                   convert(int, collationproperty(name, 'CodePage')) AS CodePage
   FROM   sys.fn_helpcollations()
   WHERE  name LIKE '%[_]CI[_]AS'
     AND  name NOT LIKE 'SQL%'
)
SELECT DISTINCT 
       replace(replace(replace(replace(
           name, '_90', ''), '_100', ''), '_140', ''), '_160', '') AS Family,
       CodePage
FROM   FamVersions
ORDER  BY Family

If you look through the list, you will see that most collation families are for a single language. For some languages there are more than one family, with Chinese taking a lead with 11 families. (Although, as we shall learn at the end of this chapter, some of these are actually the same by a different definition of collation family, that I prefer to leave out for the moment.) There are several reasons why there is more than one family for a single language. For some languages there are more than one set of rules, and we will see some examples of this later in this section. There are also a couple of languages for which there are families with different code pages for varchar; typically, these are languages that can be written with both Latin and Cyrillic scripts.

There are a few examples of collations that support a pair of languages, for instance my favourite family: Finnish_Swedish.

Then there are three collation families that covers a broader set of languages:

Latin1_General
A couple of languages, of which some are very widely spoken, that use characters supported by code page 1252, and all have rules similar enough to fit into a single collation family: English, Indonesian, Portuguese, German, Italian, Dutch and Catalan are those that I can name on the top of my head, but there are more.
Cyrillic_General
Languages that use the Cyrillic script and have the same rules as Russian, which is the leading language in this group. As for other languages, I know that Bulgarian fits in, but for several others I have difficulties to tell whether their rules fit with Cyrillic_General, or if they simply don't have a good collation of their own. (Some of the languages using Cyrillic script have their own collations: Ukrainian, Serbian and Kazakh to name a few.)
Indic_General
Many languages on the Indian subcontinent: Hindi, Punjabi, Marathi, Kannada, Telugu etc. (But not all, as testified by the fact that some languages from this area have their own collation families).

And now for some fun. Let's look at how different languages sort things differently. First, we will create a table sortwords and load it with some words. The no column gives the sorting order of the word column in English and other languages using Latin1_General and this column serves as a reference as the words move around when we sort by different collations. The remark column explains words and names that may not be known to everyone.

DROP TABLE IF EXISTS sortwords
CREATE TABLE sortwords (no     int           NOT NULL,
                        word   nvarchar(50)  NOT NULL,
                        remark nvarchar(100) NULL,
                        CONSTRAINT pk_sortwords PRIMARY KEY (no)
)

INSERT sortwords(no, word, remark)
   VALUES (1,  N'Aarhus',      'Second biggest city in Denmark'),
          (2,  N'ål',          'Swedish for "eel"'),
          (3,  N'centre',      NULL),
          (4,  N'chicos',      'Spanish for "children"'),
          (5,  N'csardas',     'Hungarian dance'),
          (6,  N'cycle',       NULL),
          (7,  N'Llanfair',    'Community in Wales'),
          (8,  N'lycklig',     'Swedish for "happy"'),
          (9,  N'ñandú',       'Rhea, an ostrich-like bird'),
          (10, N'Netherlands', NULL),
          (11, N'often',       NULL), 
          (12, N'Önnestad',    'Village where I grew up'),
          (13, N'Osnabrück',   'City in Germany'),
          (14, N'Pécs',        'City in Hungary'),
          (15, N'zuppa',       'Italian for "soup"')

If you simply run:

SELECT no, word
FROM   sortwords
ORDER  BY word COLLATE Latin1_General_CI_AS

You will see that the numbers come in order 1-15.

Data in sortwords accoring to Latin1_General

Now we will sort the list by three other collations:

SELECT no, word
FROM   sortwords
ORDER  BY word COLLATE Traditional_Spanish_CI_AS

SELECT no, word
FROM   sortwords
ORDER  BY word COLLATE Welsh_100_CI_AS

SELECT no, word
FROM   sortwords
ORDER  BY word COLLATE Modern_Spanish_CI_AS

These are the result sets:

Traditional_Spanish_CI_AS
sortwords sorted according to raditional Spanish
Welsh_100_CI_AS
sortwords sorted according to Welsh
Modern_Spanish_CI_AS
sortwords sorted according to modern Spanish

You can see that both for traditional Spanish and Welsh, chicos jumps down to come after cycle. And likewise for both traditional Spanish and Welsh, Llanfair and lycklig change places. This is because in Welsh the digraphs ch and ll are considered to be letters of their own in the alphabet, and this was also the case in the Spanish language until 1994. That is, if you would open a Welsh dictionary or an older Spanish dictionary you would find that there are separate sections for ch and ll. For readers who have never seen this before, this may seem appalling. However, it is very common to have digraphs and trigraphs to denote certain sounds in languages that use the Latin script. You have to look no further than English, where ch denotes the same sound as in Spanish (in Welsh ch is used for a completely different sound), and there is also sh and th. It is just that English do not consider these digraphs to be letters in the alphabet. And nor does Swedish, Portuguese or German consider their digraphs to be letters, to name a few. But Spanish and Welsh are by no means unique, but there are several others, of which one is coming up in a second.

What possibly is unique is the change in Spanish. In 1994, Real Academia Española (RAE) stated that having ch and ll as separate letters in the alphabet would be too difficult for computers and mandated that ch would now sort between cg and ci and likewise ll was now to sort between lk and lm. But many other languages do like Welsh and jog along with their digraphs in their alphabet. And, as you can see, SQL Server is able to deal with it. SQL Server provides two collation families for Spanish. Modern_Spanish for the post-1994 version of the Spanish alphabet, and Traditional_Spanish for the old version.

There is one more deviation from Latin1_General we have not discussed yet. For both of the Spanish collations, ñandú and Netherlands have changed places compared to English. In Spanish ñ represents a specific sound and counts as a separate letter in the alphabet. This did not change in 1994.

Let's now look at three languages that all use the glyph ö to represent the same sound, but which handle the character in different ways.

SELECT no, word
FROM   sortwords
ORDER  BY word COLLATE German_Phonebook_CI_AS

SELECT no, word
FROM   sortwords
ORDER  BY word COLLATE Hungarian_CI_AS

SELECT no, word
FROM   sortwords
ORDER  BY word COLLATE Finnish_Swedish_CI_AS

Here is the output from these three queries:

German_Phonebook_CI_AS
sortwords sorted according German_Phonebook
Hungarian_CI_AS
sortwords sorted according to Hungarian
Finnish_Swedish_CI_AS
sortwords sorted according to Finnish and Swedish

Let's start with looking at the German case. You can see that Önnestad has jumped to come before often. Why does this happen? Above we noted that ñ has a distinct pronunciation in Spanish, and it therefore counts as a letter of its own in the alphabet. But German takes a different stance. Ö has a separate pronunciation from o, but yet in a German dictionary o and ö co-sorts. And the same is true for the pairs a / ä and u / ü. The net effect of this is that for German dictionary order, Latin1_General fits the bill, and there is no need for a German dictionary collation. However, the German phone books, when they existed, applied a different sorting rule. Instead of co-sorting äöü with a, o and u, they co-sorted them with ae, oe and ue. (To see this, you can add one more word to the list, for instance Odessa.) This is by no means illogical, as this was the original spelling of these vowels, and the dots are the result of the e:s being reduced in the hand-writing of monks in the middle ages. Some German people use this spelling in their names rather than using the umlauted vowels. For instance, Baerbock rather than Bärbock. Germans also appear to be prone to change their spelling when they move to areas where umlauts are rare. As one example, the composer Arnold Schönberg changed his last name to Schoenberg when he moved to the US.

Next is Hungarian. Let's first note that while chicos stays in place, csardas moves to come after cycle. In Hungarian, ch has no special function, but instead cs counts as a letter of its own (with the same pronunciation as ch in Spanish and English.) When it comes to ö, we can see that Önnestad now comes between Osnabrück and Pécs. In Hungarian ö counts as a separate letter which comes after o in the alphabet. The same is true for ü, which comes after u. (Hungarian does not have ä.) Hungarian can also have accents on their vowels to mark that they are long, including the famous double accents for ő and ű, but the accented versions are not considered separate letters, but co-sort with the unaccented, short, vowels.

And finally we have Finnish and Swedish. As you can see, ål and Önnestad move to be at the end of the list. Swedish has three extra vowels that are not used in English: åäö. Like in Hungarian, they count as separate letters in our alphabet, but instead of having them after their lookalikes, we have placed them at the end of our alphabet in that order: åäö. (Finnish has followed this pattern, although Finnish does not really use å, only ä and ö.)

For the last set, I will poke some fun with my Scandinavian neighbours.

SELECT no, word
FROM   sortwords
ORDER  BY word COLLATE Danish_Greenlandic_100_CI_AS

SELECT no, word
FROM   sortwords
ORDER  BY word COLLATE Danish_Norwegian_CI_AS

SELECT no, word
FROM   sortwords
ORDER  BY word COLLATE Norwegian_100_CI_AS
Danish_Greenlandic_100_CI_AS
sortwords sorted according Danish and Greenlandic
Danish_Norwegian_CI_AS
sortwords sorted according to Danish and maybe Norwegian
Norwegian_100_CI_AS
sortwords sorted according to Norwegian

Just like Swedish, Danish and Norwegian have three extra vowels at the end of their alphabet: æøå in that order. As you can see the two of the glyphs are different from Swedish, but they represent the same sounds. The way it works is that ä = æ and ö = ø (in Swedish ø is known as a "Danish ö"). å is the same sound in all three languages.

So why are they in different order? Well, you see, here in Sweden we are a little more progressive. We introduced the letter å in the 15th century. The Norwegians did not come around to it until 1917. And the Danes lagged behind until 1948. For these two languages it was natural to put å at the end of the alphabet as they had had æ and ø there for a long time. This explains why Önnestad and ål have changed places. But how come Aarhus has moved from the beginning of the list to the end?

The origin for å is a long a-sound, originally spelled aa, and in hand-writing the second a evolved to be a ring above the first a. As noted above, Danish and Norwegian introduced this letter a lot later than Swedish. A consequence of this is that there are many proper names in Danish and Norwegian which still use the old spelling with aa rather than å and the city of Aarhus is one example. (Actually, when å was introduced in 1948, the city changed its spelling to Århus, but they reverted back to Aarhus in 2013.) For this reason, å and aa co-sort in the Danish collations.

OK, but why are there two collations with Norwegian in the name with different results? Of the three collations above, Danish_Norwegian_CI_AS is the oldest one, introduced in SQL 7. The two collations with 100 in the name were introduced in SQL 2008, so it is more likely that they reflect the current state of affairs. So has Norwegian made a recent change, just like Spanish? It does not seem like that. My understanding is that in Norwegian aa should sort as å if it represents that sound, but not if it is a long a from another language. Thus, Aarhus should indeed sort at the end, and so should a Norwegian person named Aaslund. But the German city of Aachen should sort in the beginning, and so should a Finnish person named Aaltonen. Having a computer to get that right is not trivial. The Spanish Academy might have been right after all: digraphs are too difficult for computers.

We will do more sorting exercises later on when we look at case- and accent-sensitivity.

The Version Number

We will now move to the next component in the collation name, the version number. There are five possible values:

None
The original Windows collations introduced in SQL 2000. I will refer to these as version-80 collations. (As I have noted they first appeared as Unicode collations in SQL 7, so version-70 would maybe be more correct, but I've decided to go with 80.)
90
SQL 2005 added support for a few more languages, and these collations have 90 in their names. There were also new collations for a few families that already had version-80 collations.
100
SQL 2008 introduced support for even more languages, and also provided new collations for almost all of the existing collation families. All of these have 100 in the name.
140
SQL 2017 came with new collations with 140 in the name, for two Japanese collation families only.
160
These collations are only available in Azure SQL Database and Azure SQL Managed Instance. They are not available in SQL 2022. These collations are all for two Chinese collation families.

Now, SQL 2012 introduced the SC collations and SQL 2019 introduced the UTF8 collations, but there are no collations with 110 or 150 in the name. How come?

My conclusion is that these version numbers somehow relate to different versions of the Unicode standard, and that defines which characters that are available and that SQL Server understands. I realise that this may not register with the reader right away, but we will look at some examples. You find the examples in this section in the script versionnumbers.sql in the zip file collationscripts.zip.

We will try this script (you may have the database from a previous demo, but I include the CREATE DATABASE statement just to make sure that it is there):

CREATE DATABASE Latin1_General_CI_AS COLLATE Latin1_General_CI_AS
go
USE Latin1_General_CI_AS
go
DECLARE @sun nvarchar(10) = N'słoǹce'
SELECT upper(@sun) AS Upper, 
       replace(@sun, N'ǹ', N'ń') AS Replace, 
       charindex(N'ǹ', @sun) AS Charindex,
       IIF(@sun = N'słoce', '=', '<>') AS IsEqual
Image of the script to the left. It features LOWERCASE N WITH GRAVE ACCENT

The script works with the Polish word słońce, which you may recall means sun. However, if you look closely, you can see that there is a misspelling in the script. It should be an n with an acute accent, U+0144, but in the script the accent is grave, which is character U+01F9.

The script attempts four things with this misspelling: change it to uppercase, replace the incorrect letter with the correct one, find the bad character, and compare it to słoce without any sort of n at all. The output is certainly puzzling:

Output from the query above.

All characters have been uppercased except for the n with the grave accent. replace does not have any effect and charindex is not able to find the character. And SQL Server suggests that the word is equal to the word without the n.

The reason for this is that in the version of Unicode that the version-80 collations are defined from, the code point U+01F9 had not yet been defined. So in the version-80 collations this is just a meaningless byte. When displaying characters, SSMS makes no considerations about the database collation, but it will display the character, as long as the font you have chosen is able to display it. With a modern font based on a later version of Unicode, it is displayed.

This explains very well why upper has no effect. SQL Server does not think of this code point as a letter, just some undefined nothing. The outcome of the equality test also makes some sense. After all, if the byte means nothing, why not simply ignore it when comparing? It is more mysterious why replace and charindex do not find the character. One would like to think that in this case, these functions would simply look for the raw byte value, but for one reason or another, SQL Server simply waves its hands and says "I don't understand this character" and gives up.

All this changes, if we the run the same code in a database with a version-100 collation:

IF db_id('Latin1_General_100_CI_AS') IS NULL
   CREATE DATABASE Latin1_General_100_CI_AS COLLATE Latin1_General_100_CI_AS
go
USE Latin1_General_100_CI_AS
go
-- All results are now as expected.
DECLARE @sun nvarchar(10) = N'słoǹce'
SELECT upper(@sun) AS Upper, 
       replace(@sun, N'ǹ', N'ń') AS Replace, 
       charindex(N'ǹ', @sun) AS Charindex,
       IIF(@sun = N'słoce', '=', '<>') AS IsEqual

Now the output is exactly what you expect:

Output of the query above with in a database with a version-100 collation

All characters are uppercased, replace corrects the spelling, charindex finds the bad character, and the misspelling is not equal to słoce.

Note: As one example of where the character ǹ (n with grave accent) is used, there is Pinyin, that is, Chinese written with Latin letters.

We will look at another example where the version number comes into play and which is of particular interest if you work with the languages of the Indian subcontinent. We will start by creating a table:

CREATE TABLE langnames (englishname nvarchar(20), 
                        nativename  nvarchar(20) COLLATE Hindi_CI_AS)

As you can see, this is a version-80 collation; there is no version number. But there is one more thing to note with the collation Hindi_CI_AS: it is one of these deprecated collations. And, as we will see, there is a good reason why it is deprecated.

Next, we add the names of some languages to this table. In the column englishname, we have the names of the languages in English, and in the column nativename, we have the names of the languages written in their native script. All but the first language in the list are official languages of India.

INSERT langnames(englishname, nativename)
   VALUES(N'Hungarian', N'Magyar'),
         (N'Telugu',    N'తెలుగు'),
         (N'Hindi',     N'हिन्दी'),
         (N'Marathi',   N'मराठी'),
         (N'Bengali',   N'বাংলা'),
         (N'Kannada',   N'ಕನ್ನಡ')
INSERT script with names of languages in India and in the languages themselves

Note: I know very little about the Indian languages, and I took the values for the nativename column from Wikipedia.

Then we run this somewhat convoluted query:

SELECT i, englishname, substring(nativename, i, 1) AS undefined
FROM   langnames
CROSS  JOIN (VALUES(1), (2), (3), (4), (5), (6)) AS V(i)
WHERE  charindex(substring(nativename, i, 1), nativename) = 0
  AND  len(nativename) >= i
ORDER  BY englishname, i

What it does, you could say, is that it loops over the characters in the values in the nativename column, and then tries to find that character in the string with charindex. That is, we are searching a string for a character that we found in the string itself. Thus, from this perspective, it should not return any rows. But we have learnt that with undefined code points, SQL Server does not even try, and this is the result set:

Output from the query with undefined code points

Not one, but two of the characters in the Hindi name for Hindi are undefined code points in this collation. No wonder this collation is deprecated!

Look what happens when we change the collation to a version-90 collation

ALTER TABLE langnames 
   ALTER COLUMN nativename nvarchar(20) COLLATE Kazakh_90_CI_AS

SELECT i, englishname, substring(nativename, i, 1) AS undefined
FROM   langnames
CROSS  JOIN (VALUES(1), (2), (3), (4), (5), (6)) AS V(i)
WHERE  charindex(substring(nativename, i, 1), nativename) = 0
  AND  len(nativename) >= i
ORDER  BY englishname, i

The result set is now empty. But what does Kazakh have to do with it? Absolutely nothing. I just took a version-90 collation with no relation to India to show that it is the version number that matters, not the collation family. Obviously, if you are to work with languages from India, the natural choice would be an Indic_General collation. (Or a Bengali or an Assamese collation, if you are specifically working with these languages.) But you should absolutely not work with a version-80 collation or an SQL collation. That is not going to work out well.

There is one more twist. Here is an example in Amharic that I picked up in a forum post. (Amharic is a major language in Ethiopia, of which I know nothing myself). Consider this query:

SELECT charindex(N'ይት' COLLATE Latin1_General_CI_AS, N'ሳተላይት'),
       charindex(N'ይት' COLLATE Latin1_General_100_CI_AS, N'ሳተላይት')

Query with strings in Amharic

This returns 1 and 4, respectively. If you look at the strings, 4 appears to be the correct value. Why do we get 1 and not 0 for the collation Latin1_General_CI_AS, when we have learnt that charindex gives up on undefined code points? This took me a while to figure out, but it seems that if you search for a string with an undefined code point in a string that starts with an undefined code point, you get a match, even if these two code points are different from each other. But if the string starts with a defined code point, there is no match. Look at this (note that the n with grave accent re-appears here):

SELECT charindex(N'ይት' COLLATE Latin1_General_CI_AS, N'ǹ'),
       charindex(N'ይት' COLLATE Latin1_General_CI_AS, N'ǹf'),
       charindex(N'ይት' COLLATE Latin1_General_CI_AS, N'fǹ')

Query sarch for Amharic letters in text of Latin letters.

This returns 1, 1 and 0 respectively. An extra spice with the first example is that the Amharic search string is two characters long, and yet we get a match in a string that is only one character long! While these results certainly are funny, the moral of the story is: use a collation based on a Unicode version that supports the language you are working with. (There are no Amharic collations, but since Amharic uses its own script, it is perfectly possible that Latin1_General_100 works, since there is no conflict between the sorting of the Latin letters and the Amharic letters.)

Finally, using the observation that charindex returns 0 when the code point is undefined, here is a query that returns how many characters in the Unicode base plane that are defined in different collation versions. The query requires Azure SQL Database or Azure SQL Managed Instance due to the appearance of the version-160 collation:

SELECT SUM(charindex(nchar(value) COLLATE Czech_CI_AS, 
           'AA' + nchar(value) + 'AA')) / 3 AS "Version 80",
       SUM(charindex(nchar(value) COLLATE Tatar_90_CI_AS, 
           'AA' + nchar(value) + 'AA')) / 3 AS "Version 90",
       SUM(charindex(nchar(value) COLLATE Thai_100_CI_AS, 
           'AA' + nchar(value) + 'AA')) / 3 AS "Version 100",
       SUM(charindex(nchar(value) COLLATE Japanese_XJIS_140_CI_AS, 
           'AA' + nchar(value) + 'AA')) / 3 AS "Version 140",
       SUM(charindex(nchar(value) COLLATE Chinese_Simplified_Pinyin_160_CI_AI_SC_UTF8, 
           'AA' + nchar(value) + 'AA')) / 3 AS "Version 160",
      COUNT(*) AS "Total code points"
FROM  generate_series(0, 65535)
WHERE value NOT BETWEEN convert(int, 0xD000) AND convert(int, 0xDFFF)

The reason for the WHERE clause will be apparent when we discuss SC collations.

This is the output:

Version 80Version 90Version 100Version 140Version 160Total code points
420424859755995583435829961440

Note: The careful reader will note that the number of code points in version-160 is actually slightly lower than the number for version-140. I don't really have a good answer for this, but as I will note in a few places later in the article, it seems that version-160 was a development from version-100 rather than from version-140.

CI/CS – Case (In)Sensitivity

On the surface, this may seem trivial. Is INSERT equal to insert? If the collation is case-insensitive (CI) it is. If the collation is case-sensitive (CS), it is not. However, as the reader will find, there are a few more things hiding here that do not meet the eye directly.

We start with creating a table, continuing in the same database where the previous section left us:

USE Latin1_General_100_CI_AS
go
SELECT IIF('INSERT' = 'insert', 'Equal', 'Not equal')
go
CREATE TABLE MyTable(a int NOT NULL, b int NOT NULL)

We can see that the query indeed returns Equal.

This query runs without problems, despite that we using different case for the table and column names:

SELECT A, B FROM mytable  

Whereas the attempt to create a table with the same name fails:

CREATE TABLE MYTABLE(a int NOT NULL, b int NOT NULL)

Msg 2714, Level 16, State 6, Line 8

There is already an object named 'MYTABLE' in the database.

This also ends with an error:

ALTER TABLE MyTable ADD A datetime

Msg 2705, Level 16, State 4, Line 14

Column names in each table must be unique. Column name 'A' in table 'MyTable' is specified more than once.

For anyone who has a background with Pascal, Visual Basic and many other environments, there is nothing strange here, but everything is perfectly normal.

Let's now create another database, this time with a case-sensitive collation:

CREATE DATABASE Latin1_General_100_CS_AS COLLATE Latin1_General_100_CS_AS
go
USE Latin1_General_100_CS_AS
go
SELECT IIF('INSERT' = 'insert', 'Equal', 'Not equal')
go
CREATE TABLE MyTable(a int NOT NULL, b int NOT NULL)

The output from the SELECT query is now Not equal as we can expect. If we again try the query above and some variations, they all fail:

SELECT A, B FROM mytable 
go
SELECT A, B FROM MyTable
go
SELECT a, b FROM mytable

The output is:

Msg 208, Level 16, State 1, Line 10

Invalid object name 'mytable'.

Msg 207, Level 16, State 1, Line 12

Invalid column name 'A'.

Msg 207, Level 16, State 1, Line 12

Invalid column name 'B'.

Msg 208, Level 16, State 1, Line 14

Invalid object name 'mytable'.

Only if we write the column and table names exactly as given in the CREATE TABLE statement, the query executes successfully:

SELECT a, b FROM MyTable

Likewise, we can now run these CREATE and ALTER TABLE statements without error:

CREATE TABLE MYTABLE(a int NOT NULL, b int NOT NULL)

ALTER TABLE MyTable ADD A datetime

And if you have previously worked in environments like C++, C#, Python or Perl, this is perfectly normal to you. What isn't exactly normal is that whether identifiers are case-sensitive or case-insensitive depends on a setting.

Next, consider this script. What outcome do you get when you run it?

USE Latin1_General_100_CI_AS
go
DECLARE @A char(3) = 'AAA',
        @a char(3) = 'aaa'
SELECT IIF(@A = @a, 'Equal', 'Not equal')
go
USE Latin1_General_100_CS_AS
go
DECLARE @A char(3) = 'AAA',
        @a char(3) = 'aaa'
SELECT IIF(@A = @a, 'Equal', 'Not equal')

Since the default for the server collation is always a case-insensitive collation, most readers are likely to get this output:

Msg 134, Level 15, State 1, Line 5

The variable name '@a' has already been declared. Variable names must be unique within a query batch or stored procedure.

Msg 134, Level 15, State 1, Line 11

The variable name '@a' has already been declared. Variable names must be unique within a query batch or stored procedure.

Variable names follow the server collation, so it does not matter which database you are in. With a case-insensitive server collation, the batches above are illegal. Myself, I swear by case-sensitive, so all instances I install for my own use have a case-sensitive server collation. Thus, when I run the script above, I get no error message, but instead I get the result Equal and Not equal respectively. Because even if the names of the variables follow the server collation, the values of the variables follow the database collation. (This was something we discussed already in the first chapter, but since it is so confusing, I figured I should add a demo for it.)

We will now create a third database, and again we use a case-insensitive collation:

CREATE DATABASE Turkish_CI_AS COLLATE Turkish_CI_AS
go
USE Turkish_CI_AS
go
SELECT IIF('INSERT' = 'insert', 'Equal', 'Not equal')

And the output is.... Not equal, what!?!? To understand what is going on, run this query:

SELECT lower('INSERT'), upper('insert')

This is the output:

The result of the lower/upper query. Note there the dot for the i s.

Look what happened with the dot on the lowercase i in insert. When all the lowercase letters moved to the left, the dot stayed and is now on top of the uppercase I in İNSERT while ınsert has a dotless i. You see, Turkish has two I letters in the alphabet. The dotted İ and the dotless I. Therefore, the upper/lowercase mappings are different than in most other languages. Before you freak out and think that this is completely stupid, permit me to point out two things:

  1. While the Turkish orthography is relatively new – Turkish was written with the Arabic script until the early 20th century – it is still older than the computer age.
  2. From a phonological point of view, it makes perfect sense. Turkish has three such pairs: ı/i, u/ü, o/ö, where the undotted vowels are back vowels and the dotted vowels are front vowels.

Turkish is not entirely unique; you get the same result with Azeri collations. Azeri is closely related to Turkish, and its Latin orthography is strongly inspired by Turkish.

Now that we know this, consider:

CREATE TABLE Inventory(a int NOT NULL)
go
SELECT * FROM inventory

As we can expect, this results in the error message

Msg 208, Level 16, State 1, Line 14

Invalid object name 'inventory'.

If you are writing software for an international market, this is something you need to be aware of. I will discuss different strategies in the upcoming chapter Collations Considerations for ISVs, but as a quick tip: develop with a case-sensitive collation. Or forget about the Turkish and Azeri markets. Trying to sort out the mess after the fact may be a gargantuan task.

I don't know of any other language with unusual upper/lower rules, but there might be more of them. But there is another surprise with case that you can run into. We will create one more database:

CREATE DATABASE Hungarian_CI_AS COLLATE Hungarian_CI_AS
go
USE Hungarian_CI_AS
go
SELECT IIF('INSERT' = 'insert', 'Equal', 'Not equal')

It prints Equal, so no surprises there. Next, we create a table:

CREATE TABLE tTable(cString varchar(20) NOT NULL)

You may recognise the naming convention here as Hungarian notation. That is, identifiers have a leading prefix in lowercase which details the type of the column (or variable or whatever), and the real name then follows with a leading uppercase. Personally, I have never been a fan of this notation, but it is certainly popular.

Now we write a query against this table. But this mixed-case in the middle of a name is a little difficult, so being lazy, we produce:

SELECT cstring FROM ttable

After all, it's a case-insensitive database, so that should work. But:

Msg 207, Level 16, State 1, Line 18

Invalid column name 'cstring'.

You may recall from the sorting examples in the section on collation families that in Hungarian the digraph cs counts as a separate letter in the alphabet. And that is the crux of the biscuit here. When you have a digraph like this, all of these make sense: CSARDAS, CSardas, Csardas, csardas and they should be considered equal with case-insensitive rules. But cSardas? Or cString? That does not make sense, or so SQL Server thinks and draws the conclusion that this cannot be the letter cs, but it must be c + s. And, certainly that conclusion is correct in this example! What SQL Server is not able to figure out is that the cs in cstring is not the same as the cs in csardas. Don't be too hard on it. It's only a computer.

Here are some more examples you can try:

SELECT IIF('cstring' = 'cstring', 'Equal', 'Not equal'),
       IIF('cstring' = 'Cstring', 'Equal', 'Not equal'), 
       IIF('cstring' = 'CString', 'Equal', 'Not equal'),
       IIF('cstring' = 'cString', 'Equal', 'Not equal')
UNION ALL
SELECT IIF('CSTRING' = 'cstring', 'Equal', 'Not equal'),
       IIF('CSTRING' = 'Cstring', 'Equal', 'Not equal'), 
       IIF('CSTRING' = 'CString', 'Equal', 'Not equal'),
       IIF('CSTRING' = 'cString', 'Equal', 'Not equal')
UNION ALL        
SELECT IIF('acstring' = 'acstring', 'Equal', 'Not equal'),
       IIF('acstring' = 'aCstring', 'Equal', 'Not equal'), 
       IIF('acstring' = 'aCString', 'Equal', 'Not equal'),
       IIF('acstring' = 'acString', 'Equal', 'Not equal')
UNION ALL
SELECT IIF('ACSTRING' = 'Acstring', 'Equal', 'Not equal'),
       IIF('ACSTRING' = 'ACstring', 'Equal', 'Not equal'), 
       IIF('ACSTRING' = 'ACString', 'Equal', 'Not equal'),
       IIF('ACSTRING' = 'AcString', 'Equal', 'Not equal')

You will find that the first three columns consistently result in Equal and the fourth consistently results in Not Equal.

As the latter examples show, this rule applies to both data and metadata, but I used metadata for the example, as the pattern with mixed case in the middle of a word is typical for identifiers. While you may be more prone to run into this if you adhere to Hungarian notation, that does not mean that you are safe if you don't use that style. Consider:

SELECT IIF('nextyear' = 'NextYear', 'Equal', 'Not equal')

This prints Not equal. TY is another digraph in the Hungarian alphabet.

This is by no means unique to Hungarian, but the same applies to ch in Traditional Spanish and Welsh, aa in Danish as well digraph letters in other languages – there are a few more of them.

What may be unique to Hungarian is that it also sports a trigraph letter, dzs (pronounced like j in English). Look at this example:

SELECT IIF('dzsungel' = 'dzsungel', 'Equal', 'Not equal'), -- Equal
       IIF('dzsungel' = 'Dzsungel', 'Equal', 'Not equal'), -- Equal
       IIF('dzsungel' = 'DZsungel', 'Equal', 'Not equal'), -- Not equal
       IIF('dzsungel' = 'DZSungel', 'Equal', 'Not equal'), -- Equal
       IIF('dzsungel' = 'dZSungel', 'Equal', 'Not equal'), -- Not equal
       IIF('dzsungel' = 'dzSungel', 'Equal', 'Not equal'), -- Not equal
       IIF('dzsungel' = 'DzSungel', 'Equal', 'Not equal'), -- Not equal
       IIF('dzsungel' = 'dZsungel', 'Equal', 'Not equal')  -- Not equal

As you can see, this leads to even more cases where strings are not equal in a case-sensitive Hungarian collation. (Interesting enough, my Hungarian dictionary only lists four words starting in dzs, and they are all loanwords from English. You may recognise dzsungel as jungle.)

One more note about Hungarian before we move on. There are two collation families for Hungarian. Beside the one simply named Hungarian, there is also Hungarian_Technical. I don't know the full story here, but when testing Hungarian_Technical_CI_AS it seems that it does not care about the digraphs, so it sorts csardas between chicos and cycle, and it considers cstring to be equal to cString. It still sorts ö between o and p.

We will now leave the case-insensitive surprises behind. Instead, we will take a closer look at how CI and CS affect sorting. Let's return to a case-insensitive database for English:

USE Latin1_General_100_CI_AS

Then we create a table that we populate with data:

DROP TABLE IF EXISTS CaseSorting
CREATE TABLE CaseSorting (no   int         NOT NULL PRIMARY KEY,
                          word varchar(30) NOT NULL, 
                          data int         NOT NULL)
INSERT CaseSorting (no, word, data)
   VALUES ( 1, 'Andrew',      9), 
          ( 2, 'ascertain',  18), 
          ( 3, 'Australia',  76),
          ( 4, 'Mixed+Case', 99),
          ( 5, 'MixedCandy', 75),
          ( 6, 'Mixedcase',   6), 
          ( 7, 'MIXEDCASE',  13), 
          ( 8, 'mixedcase',  37), 
          ( 9, 'MixedCase',  58),
          (10, 'Mixed-Case', 46)

Let's then look at how SQL Server sorts this data:

SELECT no, word, data
FROM   CaseSorting
ORDER  BY word, data

This is the output:

The data in the table CaseSorting sorted according to the databsae collation.

The values appear in the order they were entered in the table. As we can expect with case-insensitive sorting, the order of the variations of mixedcase depends on the value in the data column. Well, almost. Mixed+Case comes first, and even before MixedCandy. But this is not really strange. Generally, punctuation characters sort before digits and letters. But why then, does Mixed-Case with a hyphen come last? The hyphen is handled in a special way. If we take English as an example, there are words that can be spelled with a hyphen, but don't have to be. One such example is cooperate which also can be spelt co-operate. So from a user perspective, you don't want the hyphen to matter that much, and therefore it only has a secondary weight, and it only matters when nothing else differs.

Let's now sort the data by a case-sensitive collation:

SELECT no, word, data
FROM   CaseSorting
ORDER  BY word COLLATE Latin1_General_100_CS_AS, data

Before you look at the result, what do you expect?

Here is the result:

The data in the table CaseSorting sorted with a case-sensitive collation

Let's first consider the first three rows. The order is still Andrew, ascertain, Australia. So if you expected all lowercase to sort before or after all uppercase, you were wrong. Let's now look at all the variations of mixedcase. Mixed+case still comes before MixedCandy, because the plus character sorts before all letters. But the other five now line up in a way that causes the data column to be out of order, so obviously lower/uppercase affects the sorting after all – but it only carries a secondary weight. That is, the case of a letter only matters if nothing else differs and if so, the rule is that lowercase sorts before uppercase. When it comes to the hyphen it is demoted one more level down and it only matters when there is no other difference, not even in case.

AI/AS – Accent (In)Sensitivity

We will now move over to accent-sensitivity. It seems to me that accent-sensitive collations are more widely used than accent-insensitive, probably because this is the default with many system locales (but as we shall learn in the section Selecting the Server Collation at Setup, not all). In an accent-sensitive collation, the two English words resume and résumé are not equal to each other, but in an accent-insensitive collation, they are. (And since many English speakers tend to leave out the accents from résumé, maybe an accent-insensitive collation would be to prefer for English!)

Just like case-insensitivity can take you off-guard when you move between different languages, so can accent-insensitivity. But let's start out relatively easy. This example should not be too startling, since we have looked at these languages before:

; WITH wordpairs AS (
    SELECT a, b 
    FROM   (VALUES(N'resume',    N'résumé'),
                  (N'słońce',    N'slonce'),
                  (N'cooperate', N'coöperate')) AS w(A, B)
)
SELECT a, b, 
       IIF(a COLLATE Latin1_General_CI_AI  = b, '=', '<>') AS English,
       IIF(a COLLATE Polish_CI_AI          = b, '=', '<>') AS Polish,
       IIF(a COLLATE Finnish_Swedish_CI_AI = b, '=', '<>') AS Swedish
FROM   wordpairs

As we learnt in the section on Collation Conflicts, coercing the collation for one term in an expression, coerces the collation for all terms in the expression. Thus, we only need to force the collation on one side.

This is the output:

Output of the query above. For the first the outcome is = for all three. For the second, Polish has <> and for the third Swedish has <>.

We have already discussed the first pair, and the outcome is the same in all languages. Whether é is used at all in Polish, I don't know, but it is used in Swedish and English to mark stress (or just that the word is a fancy loan from French), and it does not count as a letter on its own in any of the languages.

Next, we have the Polish word słońce compared to slonce which is a no-word as far as a Pole is considered. In Polish both ł and ń are considered as separate letters, so the two are different also in an accent-insensitive collation. On the other hand, English and Swedish people look at that slash on the l and the accent on the n and think that's just flea crap and consider the two words to be the same in an accent-insensitive collation.

The last pair are two variations of the English word cooperate that some people prefer to write differently to mark that the two o are to be pronounced separately. One variant, which we have already mentioned, is co-operate. A second variant is to put a diaeresis over the second o. (This is not very common, but I've read two books where the authors used the spelling coöperate throughout.) Since this is only a pronunciation guide, it should be considered to be an accent, and English and Polish get it right. But it all breaks down with the Finnish-Swedish collation, since the glyph ö counts as a separate letter in these languages.

Let's look more at the letter ö. Björn is a name that is common in German and even more so in Swedish and Norwegian. Here are some variations of that name:

; WITH wordpairs AS (
    SELECT a, b 
    FROM   (VALUES(N'Björn', N'Bjorn'),
                  (N'Björn', N'Bjoern'),
                  (N'Björn', N'Bjørn')) AS w(A, B)
)
SELECT a, b, 
       IIF(a COLLATE Latin1_General_CI_AI   = b, '=', '<>') AS German_dict,
       IIF(a COLLATE German_Phonebook_CI_AI = b, '=', '<>') AS German_phone,
       IIF(a COLLATE Norwegian_100_CI_AI    = b, '=', '<>') AS Norwegian
FROM   wordpairs

This is the output:

Result of accent-insenitive comparisons of Björe, Bjorn, Bjoern and Bjørn.

As we discussed earlier, ö does not count as a letter of its own in German, and in dictionary order (that is Latin1_General) ö co-sorts with o. Thus, it's logical that they are considered equal in accent-insensitive comparison. The Norwegian (and Danish) letter ø is also considered to be a variation of o. In German phone-book sorting, as you may recall, ö co-sorts with oe. And this is mirrored in the accent-insensitive comparison: ö is no longer equal to o, but it is equal to oe. But apparently ø is not handled the same way as ö in German phone-book sorting, although the pronunciation is the same. And what about Norwegian? Norwegian does not use ö, but it has ø, with the same pronunciation. Norwegian considers ö to be an "accented" form of ø, and therefore it is distinct from o, even in accent-insensitive comparison.

This example shows that what is an accent is not always as simple as a small thing being put over, across or below a letter. Generally, we can say this: if two characters or a sequence of characters co-sort, that is, they don't have distinct places in the alphabet, they will compare as equal in accent-insensitive collation.

And now for two examples that may really blow you away:

; WITH wordpairs AS (
    SELECT a, b 
    FROM   (VALUES(N'Müller', N'myller'),
                  (N'vine',   N'wine')) AS w(a, b)
)
SELECT a, b, 
       IIF(a COLLATE Latin1_General_CI_AI = b,  '=', '<>') AS English,
       IIF(a COLLATE Polish_CI_AI = b,          '=', '<>') AS Polish,
       IIF(a COLLATE Finnish_Swedish_CI_AI = b, '=', '<>') AS Swedish
FROM   wordpairs

Here is the output:

Output of the querry above. In English and  Polish they are different, but in Swedish they are equal.

English and Polish agree that both word pairs are different, but Swedish reports a dissenting opinion. For the first pair, you may ask how two completely different glyphs can be "accents" of each other? The answer is that in Swedish, ü is known as "German y". There are very few words with ü in Swedish, but it appears in proper names of German origin. (In this example, Müller is a German name, while myller is a Swedish word meaning throng.)

Then we have the second example with the two distinct English words vine and wine, which Swedish gladly consider to be equal. An English speaker may take exception and think how can two letters that are pronounced differently be considered equal? But far from all languages use both v and w, or make a distinction between them. Polish uses only w, but still considers it to be different from v. Maybe v is so rare in Polish that it does not really matter. Swedish, on the other hand, uses both, but there is absolutely no distinction between them in pronunciation. For us, w is just a fancy way to write v. And if w appears in an abbreviation, we typically do not say dubbelve. We smile when we hear English people go at length saying double-u double-u double-u. Ourselves, we make it quick with ve-ve-ve.

Note: Swedish is actually mending its ways. When I grew up I learnt an alphabet of 28 letters (= The English alphabet minus W + ÅÄÖ.) In 2006, the Swedish Academy issued a new edition of their normative dictionary for the Swedish language in which W had an entry of its own, and the 2016 edition retained this. That would mean that the Swedish alphabet now has 29 letters. Microsoft has yet to catch up and provide a Modern_Swedish collation, though. I have put up an enhancement request for this.

There is a point in having accent-sensitivity, since accents are required for proper spelling. As I said, in Swedish w is just a fancy way of writing v, but this does not mean that you can use any of them on a whim. If you write wete (wheat in English) or vatt (the unit watt), those are spelling errors, they should be vete and watt. But there are also cases where two glyphs are entirely interchangeable, because the language permits both. One such example is offered by German, where ss can be written with a "scharfes s", ß. Consider this example:

; WITH wordpairs AS (
    SELECT a, b 
    FROM   (VALUES(N'Strasse', N'Straße')) AS w(a, b)
)
SELECT a, b, 
       IIF(a COLLATE Latin1_General_CI_AI   = b, '=', '<>') AS DictCIAI,
       IIF(a COLLATE Latin1_General_CS_AS   = b, '=', '<>') AS DictCSAS,
       IIF(a COLLATE German_Phonebook_CI_AI = b, '=', '<>') AS PhoneCIAI,
       IIF(a COLLATE German_Phonebook_CS_AS = b, '=', '<>') AS PhoneCSAS
FROM   wordpairs

This is the output:

Output of the query above.

Note: My understanding is that you can always replace ß with ss in German, but the reverse does not apply. Eßen and außehen are nothing but misspellings. But that is too advanced for SQL Server to keep track of.

Let's now turn to how sorting works with an accent-sensitive collation. Here is a table with some sample words, all taken from French, with the exception of cõte that I made up for the sake of the example.

DROP TABLE IF EXISTS AccentSorting
CREATE TABLE AccentSorting (no   int          NOT NULL PRIMARY KEY,
                            word nvarchar(30) NOT NULL, 
                            data int          NOT NULL)
INSERT AccentSorting (no, word, data)
   VALUES ( 1, 'claque',      9), 
          ( 2, 'costume',    18), 
          ( 3, 'cote',       76),
          ( 4, 'coté',       99),
          ( 5, 'côte',       75),
          ( 6, 'côté',        6), 
          ( 7, 'cõte',       22),
          ( 8, 'coteau',     13), 
          ( 9, 'côtelé',     47),
          (10, 'craindre',   37)

Just to make it clear, in case your font is small: words 5 and 6 both have a circumflex on top of the o, while word 7 has a tilde.

We sort them by the "standard" collation Latin1_General_100_CI_AS:

SELECT no, word, data 
FROM   AccentSorting
ORDER  BY word COLLATE Latin1_General_100_CI_AS, data

We see this:

Output from sorting AccentSorting by Latin1_General_100_CI_AS

That is, the sorting of the words follows the no column. As you might have guessed, accents carry a secondary weight, so they only matter when nothing else differs. In this secondary sorting, "no accent" comes first, and the later in the word the difference in accent appears, the less significant it is. That is, the same principles as when we sort letters. Of the accents in this example the circumflex sorts before the tilde. Presumably, there is a "standard" sorting for Unicode that applies, unless the language in question has its own rules. I will have to admit that this is nothing I have investigated in detail.

Let's also sort the data with an accent-insensitive collation:

SELECT no, word, data 
FROM   AccentSorting
ORDER  BY word COLLATE Latin1_General_100_CI_AI, data

Here is the output:

The table AccentSorting sorted by Latin1_General_100_CI_AI

As we could expect the five variations of cote are now sorted according to the data column, as the five variations themselves have the same sort value when accents are not significant at all.

Above, I suggested that there is a standard sorting for accents, so we can expect to get the same result with many accent-sensitive collations. But there are exceptions. Since it is a list of French words, let's look at French:

SELECT no, word, data 
FROM   AccentSorting
ORDER  BY word COLLATE French_100_CI_AS, data

The output is not the same as for Latin1_General_100_CI_AS:

The table AccentSorting sorted according to French accent-sensitive rules

French uses accents more than many other languages, and it uses a different principle for the position of the accent. They start from the end of the word when looking for accents, so all cot-words that end with an unaccented e sort before those with é.

Just to reinforce the point that what is a considered to be an accent depends on the language, here is the list of words sorted with an Estonian collation:

SELECT no, word, data 
FROM   AccentSorting
ORDER  BY word COLLATE Estonian_100_CI_AS, data

This is the output:

The text in AccentSorting sorted according to Estonian

Here cõte jumps to the end, since õ is a letter of its own in the Estonian alphabet. Estonian uses the same principle as the Nordic languages for its "extra" letters and puts them at the end of the alphabet. (Actually, the Estonian alphabet has a few more variations where some characters that are foreign to Estonian are shuffled around from the regular order, but I will spare you from those details.)

Later on, we will see an example of accents in a place where you might not expect it.

KS – Kana Sensitivity

The next sensitivity item is kana-sensitivity. This is a lot more specific, as it only relates to Japanese. Japanese can be written with a multitude of scripts. There are the ideographic kanji characters borrowed from Chinese and there is rōmaji, that is Latin letters. There are also two scripts, hiragana and katakana, that are native to Japan. Both these scripts are syllabic. That is, each character represents a syllable. For a speaker of English or Swedish where there can be several consonants before and after the core vowel of a syllable, this may sound complicated, but the Japanese syllable formula is a lot simpler, and these two scripts consist of 43 symbols each.

As I understand it, hiragana is mainly preferred for native words, while katakana is mainly used for words of foreign origin. Exactly how strict that rule is, I don't know, but let me put it this way: In English and other European languages, we have rules for when to use uppercase and when to use lowercase. Still, most of us prefer to have our searches case-insensitive. And likewise, Japanese people may prefer to have their searches to be kana-insensitive, that is, if the user enters the hiragana symbol for ku,, he or she also wants a hit if the text includes the katakana variant which also represents the syllable ku.

This is also what you get in a kana-insensitive collation. But there is no KI marker, instead a collation is kana-insensitive as long as there is no KS marker (and the collation is not binary). But just like we sometimes want an exact match on case and use a CS collation, you may want an exact match on hiragana and katakana, and that is when you use a KS, kana-sensitive, collation.

In this example, we have Nakameguro (which you may recall is a district in Tokyo) written both in hiragana and katakana. Paris also appears for reasons that will prevail in the next section. You also find this script in the file kanawidth.sql in the collationscripts.zip archive.

DROP TABLE IF EXISTS Nakameguro
CREATE TABLE Nakameguro(
             KanaInsensitive nvarchar(20) COLLATE Japanese_CI_AS    NOT NULL,
             KanaSensitive   nvarchar(20) COLLATE Japanese_CI_AS_KS NOT NULL)
INSERT Nakameguro(KanaInsensitive, KanaSensitive)
   VALUES('なかめぐろ', 'なかめぐろ'),  -- Hiragana.
         ('ナカメグロ', 'ナカメグロ'),   -- Katakana.
         ('Paris',     'Paris')
Picture of query with words in hiragana and katakana

We run this query over the table:

SELECT COUNT(DISTINCT KanaInsensitive) AS KI, COUNT(DISTINCT KanaSensitive) AS KS
FROM   Nakameguro

We see in the output that the KI column returns 2, and the KS column returns 3. That is, in the KS column all strings are distinct, but in the KI column, two strings are considered to be equal to each other. (And Paris is not one of them!)

It is worth noting that while this example uses a Japanese collation, and kana is very much a thing only for Japanese, all collation families have both kana-insensitive and kana-sensitive collations. For instance, Mapudungan_100_CI_AS_KS is a perfectly legal collation. (Mapudungan or Mapuche is a Native American language, mainly spoken in Chile.)

WS – Width Sensitivity

Let's now turn to width sensitivity, which also is primarily an East-Asian thing, although the prime target are Latin letters.

You may note in the script in the previous section that although Paris and Nakameguro in any of the two kana scripts are five characters, the strings with the kana symbols are about twice in width. And this is not unique for the kana symbols, but it is also true for the ideographic characters used by Chinese and Japanese as well as the hangul syllable blocks of Korean. Typically, an East-Asian character fits into a square, whereas a Latin character fits into a rectangle where the base is half the size of the height.

When the first character sets for East Asian languages were designed in the 1970s or so, computing was dominated by fixed monospaced characters, which emphasised the shape difference between Latin and East-Asian characters. For a mix of aesthetical and technical reasons, some of the East-Asian character sets added two-byte versions of the ASCII letters and digits to match the East-Asian characters. When printed, these two-byte versions also had the same width as the East-Asian characters. For related reasons, going in the opposite direction, one of the Japanese standards introduced katakana characters that took up a single byte and a single rectangle.

An important requirement when Unicode was designed was round-trip compatibility with legacy character sets, which meant that if a character existed in a legacy character set, it had to be added to Unicode. This led to Unicode having a block of halfwidth characters (narrow versions of East-Asian characters) and a block of fullwidth characters (wide versions of Latin characters). If you want to know more about this concept, see the articles Halfwidth and fullwidth forms and Halfwidth and Fullwidth Forms (Unicode block) in Wikipedia.

It goes without saying that normally you want a search to disregard whether a character is halfwidth or fullwidth, but apparently there are situations when you want to discern them, so there are collations for both cases. A collation is width-insensitive as long as there is no WS in the collation name (and it is not a binary collation). Add WS and it is width-sensitive. Just as for kana-sensitivity, you can do this for any collation family, not only East Asian ones. For instance, Norwegian_100_CI_AS_WS is a legal collation name.

Let's have a look at a demo. (You also find this script in the file kanawidth.sql in the file collationscripts.zip, the same as we used in the previous section.) To the left two fullwidth strings, Paris and Nakameguro in katakana, and to the right the same names in halfwidth. (That is the intention; how well this actually displays in your browser is another matter, but also see the screenshot below the script.)

; WITH Widthdemo AS (
   SELECT fullwidth, halfwidth     
   FROM  (VALUES (N'Paris', N'Paris'), 
                 (N'ナカメグロ', N'ナカメクロ')) AS V(fullwidth, halfwidth) 
)
SELECT fullwidth, halfwidth, 
       len(fullwidth) AS len_fullwidth, len(halfwidth) AS len_halfwidth, 
       IIF(fullwidth COLLATE Japanese_CI_AI =
           halfwidth COLLATE Japanese_CI_AI, 'Equal', 'Different') AS Insensitive,
       IIF(fullwidth COLLATE Japanese_CI_AI_WS = 
           halfwidth COLLATE Japanese_CI_AI_WS, 'Equal', 'Different') AS WS
FROM   Widthdemo

Screenshot from SSMS of the WS query above.

Beside the comparisons, I've also added the result of the len function to the result set, so that you can see that all strings are indeed five characters.

Here is the output:

Result set of the WS query

The distinction between fullwidth and halfwidth is not very good in the results grid for Paris. As you can see, the fullwidth version is actually somewhat narrower than the halfwidth version. This is somewhat mysterious, since for the screenshots in this article, I'm using Consolas 12 pt for both the text editor and grid results.

If you look closely at this example, you will note that I am using accent-insensitive collations. That is not just on a whim, but there is a reason for it. If you switch to an AS collation, you will find that the fullwidth and halfwidth versions of Nakameguro in katakana are no longer equal. In fact, the halfwidth version does not say Nakameguro, but Nakamekuro. I had to do it that way, because there is no halfwidth version of the gu syllable, there is only ku. The Western reader may ask, but how can gu be equal to ku, even if the collation is accent-insensitive? They can't be accented forms of each other, can they?

Yes, they can. Here are inflated versions of the ku and gu characters in both hiragana and katakana:

The ku and gu characters in both hiragana and katakana

While the hiragana and katakana glyphs are considerably different from each other, the difference between ku and gu is the same in both kanas: there is a small glyph which looks like a double quote to the Western eye on the gu character. This glyph is known as a dakuten, and it is the equivalent to an accent, as we can tell from the example. For a person with English or French as the native language, the very idea of distinguishing between k and g only by the means of an accent may seem corny. But the two sounds are articulated in the same place in the mouth, and both are stops. The only difference is in voicing, so why could not that be an accent?

This is not unique to the pair of ku and gu, but all voice/unvoiced pairs are distinguished with a dakuten (or a handakuten) in the two kanas.

VSS – Variation-Selector Sensitivity

As I have already mentioned, the VSS collations were introduced in SQL 2017 with the version-140 collations that are only for Japanese. So variation selectors is another thing that is unique to Japanese? Well, yes and no. Variation selectors can apply to more than just Japanese. But variation-selector sensitivity only seems to apply to Japanese characters.

Variation selectors is a fairly specialised topic that not all readers may be interested in. Also, it takes a bit of space to explain. For this reason, I have deferred this topic to a section in a later chapter where we will also learn about another Unicode modifier.

BIN/BIN2 – Binary Collations

We now turn to the binary collations. In binary collations, all comparisons are by the code point. This means that these collations are always case-, accent-, kana- and everything else-sensitive. It also means that Zagreb sorts before ant, and zone sorts before Århus regardless of the collation family. This is not really aimed at being user-friendly, so binary collations may not be your first choice. However, there are some situations where binary collations can come in handy, and we will look at a few tricks in this section. Later in the article, we will look at how binary collations can help to improve performance.

In the name of a binary collation, there are no CI/CS, AI/AS etc. Instead the version number (if there is any) is directly followed by _BIN or _BIN2, for instance Latin1_General_100_BIN and Latin1_General_100_BIN2. Due to legacy there are two styles of binary collation BIN and BIN2. To understand the difference between them, we will work some queries that you also find the file binarycollations.sql in the collationscripts.zip file. We start with creating a table to which we add some characters used in Croatian:

USE Latin1_General_CI_AS
go
-- Create a table with some two-character sequences.
CREATE TABLE binarytest (BIN2_col    nchar(3) COLLATE Croatian_BIN2 NOT NULL,
                         BIN_col     nchar(3) COLLATE Croatian_BIN  NOT NULL, 
                         binarycol   binary(6) NOT NULL)
go
INSERT binarytest(BIN2_col, BIN_col, binarycol)
   VALUES(N'ABB', N'ABB', convert(binary(6), N'ABB')), 
         (N'AĆĆ', N'AĆĆ', convert(binary(6), N'AĆĆ')), 
         (N'ĐĆĆ', N'ĐĆĆ', convert(binary(6), N'ĐĆĆ'))
Picture of the VALUES clause just above.

The table has three columns, two with the different types of binary collations and the third is of the binary data type. We insert the same strings to all columns. To understand what we are about to see, we need to know the numeric values of the code points, and we inspect them both in decimal and in hex:

SELECT unicode(N'A') AS A, 
       unicode(N'B') AS B, 
       unicode(N'Ć') AS Ć,
       unicode(N'Đ') AS Đ 

SELECT convert(binary(2), unicode(N'A')) AS A, 
       convert(binary(2), unicode(N'B')) AS B, 
       convert(binary(2), unicode(N'Ć')) AS Ć,
       convert(binary(2), unicode(N'Đ')) AS Đ 

(Optimistically, I assume that the Croatian characters will appear correctly for most users, and therefore I do not include images of remaining queries in the section.)

This is the output:

Output of the queries above. The decimal values are 65,  66, 262 and 272.

From this we can expect that when we sort with a binary collation, the values should come in the order they appear in the INSERT statement. We are also inclined to expect the same to happen if we sort by binarycol:

SELECT *
FROM   binarytest
ORDER  BY binarycol

But that is not what we get:

Output of the query above. The values come in the order ĐĆĆ, AĆĆ, ABB

If we look at the values in the binarycol column, we can understand why we got that order. For every character, the bytes have been swapped so that the second byte comes first. But why does that happen?

The answer has nothing to do with collations, but is related to the matter of endianness, which I discussed earlier in the section Enter Unicode. The Wintel architecture, on which SQL Server is founded, is little-endian, that is, the bytes in a word are stored in reverse order, hence the output above.

Note: You may ask, what happens if we are running SQL Server on a Linux machine with a big-endian architecture. I don't have access to such a machine, so I have not tested. But my assumption is that the result is the same, since otherwise there would be quite severe compatibility issues when moving a database from one architecture to another.

If we instead sort by the column with the collation Croatian_BIN2, the scene changes:

SELECT *
FROM   binarytest
ORDER  BY BIN2_col

Now we get the expected order:

Output of the query with ORDER BY BIN2_col.

But that still leaves the question, what is the difference between BIN and BIN2? Well, let's try it:

SELECT *
FROM   binarytest
ORDER  BY BIN_col

The output is indeed different from the other two:

Output from the query with ORDER BY BIN_col

Can you see what is going on? Don't feel bad if you don't, because it is quite crazy: they swap the bytes of the first character, but leave remaining bytes as-is. One is inclined to ask what they were smoking when they came up with this scheme, but presumably they assumed that only swapping one byte instead of many would be better for performance. And, after all, performance is one of the more compelling reasons for using binary collations. However, in my performance tests, I have not been able to detect a difference between the two. To be fair, the design of BIN collations is more than 20 years old, and possibly there was a discernible difference with the hardware that was current at the time.

As I said, the BIN collations are legacy collations and the BIN2 collations, introduced in SQL 2005, are a better choice in my preference. Then again, you could argue that since ordering according to code-point values is quite useless for many languages, it does not really matter which of them you use.

Let's now look at a few situations where a binary collation can be a helpful tool. Let's say that we have this table with a list of words:

USE Latin1_General_CI_AS
go
CREATE TABLE morewords (word nvarchar (20) NOT NULL PRIMARY KEY)
INSERT morewords (word)
   VALUES('Abraham'), ('apple'), ('Berlin'), ('cigar'), ('London'),
         ('purple'), ('Simon'), ('Zagreb'), ('zuppa')

Occasionally, I see people who are trying to find the words that start in uppercase, and they are used to character ranges from regular expressions in Unix and elsewhere, so they try:

SELECT word 
FROM   morewords
WHERE  word LIKE '[A-Z]%'

But all words come back. This is because LIKE respects the collation which is case-insensitive. When people realise this, they try a case-sensitive collation:

SELECT word 
FROM   morewords
WHERE  word COLLATE Latin1_General_CS_AS LIKE '[A-Z]%'

But the output is the same. Well, almost, apple is no longer there. As we have learnt, in a case-sensitive collation, case only has a secondary weight, and the order is aAbBc... which is why the pattern excludes apple.

A binary collation saves the day:

SELECT word 
FROM   morewords
WHERE  word COLLATE Latin1_General_BIN2 LIKE '[A-Z]%'

The output:

Output of the query above. Only words starting in uppercase.

As you realise, this only works for English and a few more languages. For other languages, you have to add non-ASCII letters explicitly. For instance, for Swedish the range would have to be [A-ZÅÄÖ].

Another situation where binary collations can help you is when you run into problems with undefined code points. You may remember from the section The Version Number that when you search for an undefined code point with charindex or replace, SQL Server just gives up and does not find the character. For the examples in that section, we were able to get the expected results by using a version-100 collation. But the character you are working with may have been added in a much later version of Unicode. Or may not have been added at all, and will never be, because it is a control character.

Consider this snippet, where I purposely include in a NUL character, U+0000, which I then try to get rid of:

USE Latin1_General_100_CI_AS
go
DECLARE @x varchar(10) = 'ABC' + char(0) + 'XYZ'
SELECT @x AS str, len(@x) AS strlen
SELECT @x = replace(@x, char(0), '')
SELECT @x AS str, len(@x) AS strlen

The output is certainly a little confusing:

Output of the batch above. We see ABC and 7. This is repeated.

SQL Server reports the string as being seven characters long, but it only displays three characters? No, that's the wrong conclusion. SQL Server never displays anything; it only sends data on the TDS channel. And it does send all seven characters. The culprit is SSMS, which is implemented in a language derived from C, and in C the NUL character serves as the string terminator, so the characters following the NUL character are lost in the output.

As you can see in the script, I try to remove the invisible character with replace, but that does not work out. This is because U+0000 is not a defined code point, and therefore replace has no rules to work from and gives up.

The solution is to cast to a binary collation, because in a binary collation all code points are equally good, and all characters are themselves and never equal to something else. Add these two lines to the script above:

SELECT @x = replace(@x COLLATE Latin1_General_BIN2, char(0), '')
SELECT @x AS str, len(@x) AS strlen

In addition to the two result sets above, we also get this:

Result set with two cells, ABCXYZ and 6

This time the NUL character was removed, thanks to the binary collation.

The last thing I like to discuss is the abundance of binary collations. When you look at the list of the collations, you find that all collation families have their own _BIN and _BIN2 collations. And if there are different versions for the same family, both versions have a pair of binary collations, so, for instance, there is Croatian_BIN, Croatian_BIN2, Croatian_100_BIN and Croatian_100_BIN2. Do really all these binary collations have unique properties?

No, they don't, but nor are all of them the same. While they are blind to language rules for comparing, sorting and grouping, there are still three properties that depend on the collation family or the version number:

So, yes there are differences. But to take one example: all these collations are identical: Albanian_100_BIN2, Bosnian_Latin_100_BIN2, Croatian_100_BIN2, Czech_100_BIN2, Hungarian_100_BIN2, Hungarian_Technical_100_BIN2, Polish_100_BIN2, Romanian_100_BIN2, Serbian_Latin_100_BIN2, Slovak_100_BIN2, Slovenian_100_BIN2 and Turkmen_100_BIN2. That is, all version-100 BIN2 collations that use code page 1250. (With the reservation that there may be some upper/lower rules that I am unaware of.)

SC – Surrogate Compatible

All collations we have worked with so far only support the Unicode base plane. In SQL 2012, Microsoft introduced the SC collations which extend the support to include the supplementary planes. SC can be read as supplementary characters, which seems to be the term used in the SQL Server documentation. Myself, I tend to use surrogate-compatible instead. The meaning of that term will be clear in a second.

SQL 2012 did not introduce any version-110 collations, instead Microsoft created SC collations for all non-binary version-90 and version-100 collations, but not the version-80 collations. Thus, there is Latin1_General_100_CI_AS_SC as well as Indic_General_90_CI_AS_SC, but there is no Latin1_General_CI_AS_SC.

As you see from the examples, a surrogate-compatible collation is marked by the part _SC appearing in the name. However, this is not true for the version-140 collations. They are all surrogate-compatible (save for the binary collations) without _SC appearing in the name, and Microsoft has stated that all future non-binary collations will be surrogate-compatible. However, they have been inconsistent with the naming. The version-160 collations are all surrogate-compatible, but they all have _SC in the name.

If you have a fondness for the old and deprecated data types text and ntext, you should know that they are not supported with SC collations.

So what is this talk of surrogates about? In UTF‑16, characters beyond the Unicode base plane are encoded in surrogate pairs, where the high word is in the range U+D800 to U+DBFF and the low word is in the range U+DC00 to U+DFFF. As you may guess, these code points do not define any characters of their own in Unicode.

We will work with something I will refer to as the Gothic string. This string consists of the first five letters in the ancient Gothic alphabet. (As you may know, all Gothic languages are extinct. Letters for extinct alphabets are typically found in the supplementary planes, although some, like runes, are in the base plane.) To spice things up, there is also an emoji. (Of the characters in the supplementary planes, the emojis are probably the ones that are the most widely used, at least among people who don't use ideographic characters.)

We will first look at this string converted to binary, so that we can see the bytes. (You find the scripts in this section in the file surrogates.sql in the download file collationscripts.zip.)

DECLARE @gothic nvarchar(20) = N'𐌰𐌱𐌲𐌳𐌴😕'
SELECT convert(binary(24), @gothic)
Picture of the query to the left with Gothic characters.

For the rest of the section, I don't include images of the code, as it is the same string that reappears.

This is the value returned:

0x00D830DF00D831DF00D832DF00D833DF00D834DF3DD815DE

To get a better understanding, let's split up in parts of four bytes each:

0x00D830DF
  00D831DF
  00D832DF
  00D833DF
  00D834DF
  3DD815DE

Recall that SQL Server and Wintel are little-endian, so we need to swap the bytes. Let's also add a separator between the 16‑bit words:

 0xD800 DF30
   D800 DF31
   D800 DF32
   D800 DF33
   D800 DF34
   D83D DE15

We can now see the surrogate pairs according to the description above.

Let's now look at how SQL Server handles this string in different collations, starting with a version-80 collation:

USE Latin1_General_CI_AS
go
DECLARE @gothic nvarchar(20) = N'𐌰𐌱𐌲𐌳𐌴😕'
SELECT @gothic AS word, charindex(N'😕', @gothic) AS charindex, 
       len(@gothic) AS charlen, datalength(@gothic) AS bytelen, 
       substring(@gothic, 1, 3) AS FirstThree

This is the output:

First part of output of the query above.

The word itself displays fine, because SSMS understands UTF‑16 in full, and the database collation does not affect how SSMS displays the data. The column bytelen also has the correct value, since datalength simply counts bytes. But the remaining columns return the wrong values. charindex returns 1, which indicates that with this collation, SQL Server does not understand the surrogate at all. That is, all these bytes are undefined code points. You may recall from the section about version numbers that if we search for an undefined code point, we usually get 0 back for no hit at all. But we get back 1, if the search string starts with an undefined code point as well, and this is why charindex reports 1 here.

The column charlen reports 12 characters, when there are only six. But once we know that the surrogates are not understood, this is not surprising. The len function simply counts byte pairs (excluding trailing spaces). The column FirstThree has the first A-looking character from the Gothic alphabet, but it is followed by a question mark in a rhombus. We learnt earlier that this is U+FFFD, REPLACEMENT CHARACTER, which we typically see when there is an encoding error. SSMS, which understands UTF‑16, realises that something is wrong when it gets only one half of a surrogate pair, and therefore it displays the replacement character. The reason is that with this collation, substring naïvely thinks that each character is two bytes long, and therefore returns one surrogate pair and a half.

Let's now run the same query with a version-90 collation:

CREATE DATABASE Kazakh_90_CI_AS
go
USE Kazakh_90_CI_AS
go
DECLARE @gothic nvarchar(20) = N'𐌰𐌱𐌲𐌳𐌴😕'
SELECT @gothic AS word, charindex(N'😕', @gothic) AS charindex, 
       len(@gothic) AS charlen, datalength(@gothic) AS bytelen, 
       substring(@gothic, 1, 3) AS FirstThree

There is one change in the output:

Output from the Gothic query in a database with the collation Kazakh_CI_90

The column charindex now has the value 11. The value is of course incorrect, since there are only six characters. Still the change tells us something: with a version-90 collation, the code points for the surrogates are no longer considered undefined, but somehow, SQL Server understands the surrogates as characters, even though it does not understand that they should be paired together. This explains why Microsoft created SC collations both for version-90 and version-100 collations, but not for version-80 collations.

It is now time to try a surrogate-compatible collation. Which one does not matter, but I decided to go with one from my neighbourhood.

CREATE DATABASE Norwegian_100_CI_AS_SC COLLATE Norwegian_100_CI_AS_SC
go
USE Norwegian_100_CI_AS_SC
go
DECLARE @gothic nvarchar(20) = N'𐌰𐌱𐌲𐌳𐌴😕'
SELECT @gothic AS word, charindex(N'😕', @gothic) AS charindex, 
       len(@gothic) AS charlen, datalength(@gothic) AS bytelen, 
       substring(@gothic, 1, 3) AS FirstThree

This is the output

Output from the Gothic query in a database with a surrogate-compatible collation

This time both charindex and charlen returns the correct number, 6. Furthermore, FirstThree now returns the first three real characters and not one and a half. The value of bytelen does not change, as that value is not affected by the collation.

Earlier in the article, I said that Unicode was limited to 20½ bits, which certainly seems kind of odd. The answer to this oddness lies with the surrogate pairs. Staying in the database Norwegian_100_CI_AS_SC, run this query, which combines the two highest possible surrogates:

SELECT convert(binary(4), unicode(nchar(0xDBFF) + nchar(0xDFFF)))

The query returns 0x0010FFFF, which is the highest possible code point in Unicode. (The unicode function returns the code point value for a character in decimal.)

Finally, let's look at what happens if we try to insert our Gothic string into an nvarchar(10) column. After all, it's only six characters, so that should work, or?

DECLARE @t TABLE (word nvarchar(10) NOT NULL)
INSERT @t (word) VALUES (N'𐌰𐌱𐌲𐌳𐌴😕')
Picture of the query to the left.

You may recall the Japanese example earlier when we tried to enter five ideographic characters into a varchar(8), and the result is the same here:

Error message "String or binary data would be truncated"

nvarchar(10) does not mean 10 characters, it means 10 byte-pairs. Like the Japanese example, it may not be entirely practical, but this is what the Gothic peoples have had to put up with since SQL 2012. Well, they would have had to put up with it, had any of them still existed.

UTF8 Collations

We have now arrived at the last component that can appear in the name of a Windows collation, _UTF8. The UTF‑8 collations were introduced in SQL 2019. Microsoft took all surrogate-compatible collations and created UTF‑8 collations from these. This has two implications:

From a technical perspective, what signifies the UTF‑8 collations is that the code page for varchar does not follow from the collation family, but it is always 65001, that is, UTF‑8. Or in other words, with a UTF‑8 collation you have full Unicode support with both varchar and nvarchar.

To illustrate this, let's first look at what happens in a collation without UTF‑8 support. (You also find the scripts for this section in the file UTF8.sql in the file collationscripts.zip.)

USE Norwegian_100_CI_AS_SC
go
SELECT 'räksmörgås' AS shrimps,
       'słońce'     AS sun, 
       'αβγδ'       AS Greek,
       '中目黒'      AS Nakameguro,
       'Paris'  AS Fullwidth,
       '𐌰𐌱𐌲𐌳𐌴😕'   AS Gothic
go
SELECT N'räksmörgås' AS shrimps,
       N'słońce'     AS sun, 
       N'αβγδ'       AS Greek,
       N'中目黒'      AS Nakameguro,
       N'Paris'  AS Fullwidth,
       N'𐌰𐌱𐌲𐌳𐌴😕'   AS Gothic
The SELECT statement to the left. A SELECT of six words as varchar and nvarchar.

We have seen all these words before, but to repeat: räksmörgås is Swedish for shrimp sandwich, the second word is Polish for sun, the third string consists of the first four letters in the Greek alphabet and the fourth is the Tokyo district Nakameguro written in Kanji. The fifth is Paris written with fullwidth characters and last is the Gothic string which served as an example in the previous section. We select these strings both as varchar and nvarchar. This is the output:

Output from the two queries above.

When we use nvarchar, we get back all strings correctly, but with varchar, we get fallback characters for the characters not available in code page 1252, the code page for non-UTF8 Norwegian_100 collations. In passing, the display of the Gothic string is a little odd. There are 12 question marks, one for each surrogate, rather than one for each character.

Let's now create a database with a UTF‑8 collation, and let's also take the occasion to celebrate. You may recall that some collation families only support nvarchar for the older collations without UTF‑8 support, since they don't have a code page for varchar. And for this reason, you cannot use these collations as server or database collations. But now, with the advent UTF‑8 collations in SQL 2019 the peoples of India, Laos, Malta and a few more countries can finally create databases with a collation for their languages!

CREATE DATABASE Indic_General_100_CI_AS_SC_UTF8 
        COLLATE Indic_General_100_CI_AS_SC_UTF8
go
USE Indic_General_100_CI_AS_SC_UTF8
go
SELECT 'räksmörgås' AS shrimps,
       'słońce'     AS sun, 
       'αβγδ'       AS Greek,
       '中目黒'      AS Nakameguro,
       'Paris'  AS Fullwidth,
       '𐌰𐌱𐌲𐌳𐌴😕'   AS Gothic
go
SELECT N'räksmörgås' AS shrimps,
       N'słońce'     AS sun, 
       N'αβγδ'       AS Greek,
       N'中目黒'      AS Nakameguro,
       N'Paris'  AS Fullwidth,
       N'𐌰𐌱𐌲𐌳𐌴😕'   AS Gothic
Picture of the query to the left.

Now we get the same output for varchar and nvarchar:

Output of the two most recent queries - the result sets are equal.

One reason that UTF‑8 is popular in some circles is that it is said to save space. Let's scrutinise this claim. This is not the place to give a full explanation of the UTF‑8 encoding, but here is an overview of the most crucial parts. UTF‑8 is a variable-length encoding with the following pattern:

As you can see from this summary, the claim that UTF‑8 saves space is a very Western-centric one. You could say that when Microsoft opted to use UTF‑16 for Windows and nvarchar in SQL Server, they took a culture-neutral position.

Let's now create a table to which we add the words above and one more:

CREATE TABLE words (no   int NOT NULL, 
                    word varchar(50) NOT NULL)
INSERT words (no, word)
   VALUES(1, 'English'), 
         (2, 'Räksmörgås'), 
         (3, 'słońce'),
         (4, 'αβγδ'),
         (5, '中目黒'), 
         (6, 'Paris'), 
         (7, '𐌰𐌱𐌲𐌳𐌴😕')
Picture of the query to the left.

We run this query:

SELECT no, word, len(word) as charlen, datalength(word) AS bytelen,
       substring(word, 1, 3) AS FirstThree
FROM   words
ORDER BY no

Which produces this output:

Output of the query above.

We can see that the columns charlen and FirstThree match what we see in the word column for all entries. As for bytelen, English is seven bytes, the same as the number of characters – it's all ASCII. Räksmörgås is 13 bytes – there are three non-ASCII characters in that word, but they are all in the Latin‑1 range. słońce is eight bytes, six plus two extra for the non-ASCII characters. The Greek letters require eight bytes; we are still in the range where characters take up two bytes. However, Nakameguro written in Kanji takes up nine bytes. You may recall that when we looked at the same name in varchar with the collation Japanese_CI_AS, it only required six bytes. Paris in fullwidth requires 15 bytes, because the fullwidth forms are close to the end of the Unicode base plane. Finally, the Gothic string requires 24 bytes, just like it does with nvarchar.

Let's now try this:

CREATE TABLE varchar6 (sun varchar(6) NOT NULL)
INSERT varchar6(sun)
    VALUES('słońce')

There are six characters, so it should work. Or? If you have been reading this article carefully, you may guess that there will not be a happy ending. Here is the error message:

Msg 2628, Level 16, State 1, Line 38

String or binary data would be truncated in table 'Indic_General_100_CI_AS_SC_UTF8.dbo.varchar6', column 'sun'. Truncated value: 'słoń'.

The statement has been terminated.

It fails, because varchar(6) means six bytes, not six characters. When some Western programmers see this, they go ballistic screaming What is Microsoft thinking! But as we have seen, this is nothing new. The Japanese, the Koreans and the Chinese have had to put up with this for decades now.

It can't be denied that it would be more practical if varchar(6) meant six characters and not six bytes, and this is also how it works for instance in Postgres. Oracle even goes as far as for some of their string types to let you choose whether you mean bytes or characters. I don't want to speculate how likely it is that Microsoft would implement a similar choice. However, with the current architecture there is a limitation of 8000 bytes per in-row column, so a varchar(n CHARACTERS), would need to have a maximum for n at 2000 for UTF‑8 collations to permit for a string with 2000 emojis or other characters outside the Unicode base plane.

In any case, for the time being we will have to live with that varchar(6) means six bytes, like it or not. Let's now look at some practical implications. Say that you want to permit 30 characters for a City column. If your business mainly is the part of the world where the Latin script is used, you could take your chances with varchar(40), so that you can fit some non-ASCII characters. But if you need to support names in Greek or Cyrillic script you would need varchar(60). And if you need to account for ancient Gothic cities, you will need varchar(120).

From one perspective, this does not have to be a big deal. A ten-byte string does not take up more space because it is in a varchar(120) column rather than in a varchar(30) column. Although, to be precise, there can be a small effect on optimisation, since when the optimiser estimates the memory needed for grants, it takes the max size of variable-length columns in account.

On the other hand, it could be the case that you really want to be sure that the string is restricted to 30 characters, for instance to make sure that it fits into the window of an envelope. You could work around that by adding a constraint:

City varchar(40) NOT NULL
   CONSTRAINT ck_City_30chars CHECK (len(City) <= 30)

Recall that len returns the length in characters, disregarding their size in bytes.

Let's say that you are about to design a new application, what to choose? varchar with a UTF‑8 collation or nvarchar with any collation? Certainly, this is a matter of preference, but my choice would be nvarchar, for a couple of reasons:

But again: this is a matter of preference.

It's an entirely different matter, if you have an existing application which uses varchar with a legacy code page. Imagine that you are working in a Dutch company with business in the neighbouring countries. "To save space", you have used varchar with the collation Latin1_General_CI_AS. Then one day your boss tells you: by the way, we have bought a business in Czechia, can you make sure that the system can handle Czech names properly? No, Czech names will not play nicely in that collation. If you were to solve this by switching to nvarchar, it would be a lot of work: To start with, you would have to alter lots of varchar columns to nvarchar, and those are size-of-data operations. You would also have to wade through the code to make sure that variables and temp-table columns use nvarchar where needed. String literals may have to be preceded by N. With a great likelihood, you will need to make changes in the application code as well to reflect the change of data types. And on top of that, you need to test it all.

This is where a UTF‑8 collation can save the day, particularly if you can stay with your current lengths (because they are generous enough that the risk for truncation is small). Certainly, it is still not an easy ride; changing the database collation is always a major undertaking. If you need to implement this, I very much recommend that you read the accompanying article Changing the Server and Database Collation in SQL Server. I have included some tips that are specific to the situation where you need to switch to a UTF‑8 collation.

I will close this section, with a few more titbits about UTF‑8 collations. There is one single binary collation for UTF‑8, Latin1_General_100_BIN2_UTF8. Since it is mainly the code page that distinguishes the other binary collations, it makes sense to only have one binary UTF‑8 collation. Well, almost. If you want a binary UTF‑8 collation with Turkish rules for upper- and lowercase, there is none.

The char function offers a trap with UTF‑8 collations. This returns a so called n-dash:

SELECT nchar(8211)

So it would be logical if char(8211) also returned an n-dash in a UTF‑8 collation, but it doesn't. Instead, you get NULL back. This is true for all code points ≥ 128. This is nothing that is unique to UTF‑8 collations, but it applies to all collations where the code-page has multi-byte characters. This is due to the fact that the char function returns char(1) which only fits a single byte.

As I have mentioned, there are 32 version-160 collations in Azure SQL Database that all are for Chinese. Interestingly enough, all are UTF‑8 collations, despite that UTF‑8 does not play well with Chinese. Then again, if the Chinese are wise, they stick with nvarchar when they use these new collations. What the reasoning is here, I don't know. But it could be that this is a new policy from Microsoft. In the future, when they add new collations, there will only be UTF‑8 collations, and no collations with legacy code pages. As long as there is no official word, it is difficult to tell, though.

The last observation is about collation conflicts, or rather the lack of them. If you take a pair of collations, where one is a UTF‑8 collation and the other collation has exactly the same properties, except that it uses a legacy code page for varchar, these collations are identical for nvarchar. Microsoft has observed this and made an exception. Consider:

CREATE TABLE mixture(a nvarchar(20) COLLATE Latin1_General_100_CI_AS_SC,
                     b nvarchar(20) COLLATE Latin1_General_100_CI_AS_SC_UTF8)
SELECT * FROM mixture WHERE a = b

Normally, when you compare columns with different collations, you get a collation conflict and an error. But not in this case; this compiles and runs. Change the columns to varchar, and you will get back the notorious error Cannot resolve the collation conflict... (You could argue that since the varchar data is converted to UTF‑16 internally, nor is there really a conflict in this case, but Microsoft has not taken it that far.)

Collation Families and Windows Locales

Earlier I quoted Books Online's explanation of Windows collations: Windows collations define rules for storing character data that's based on an associated Windows system locale. In this section, we shall try to get a better understanding of this means with help of the function collationproperty.

We have already looked at this function to retrieve the code page of a collation, but the function accepts three more documented properties you can look up: Version, ComparisonStyle and LCID of which the last one is the main focus for this section. Let's just for completeness's sake cover the first two.

If you say

SELECT collationproperty('somecollation', 'Version')

you will get back 0 for a version-80 collation, 1 for version-90, 2 for version-100, 3 for version-140 and 4 for version-160.

ComparisonStyle is a bit mask. When 0, this means the collation is sensitive to all four of case, accent, kana and width. This value covers CS_AS_KS_WS collations as well as binary collations. When the lowest bit is set, this means that the collation is insensitive to case, so a value of 1 is a CI_AS_KS_WS collation. If the second-lowest bit is set, the collation is accent-insensitive. You may expect that the next bit is for kana-insensitivity, but not so fast. No, kana-insensitivity is marked by the 16th bit from the left and width insensitivity is the 15th from the left. Thus, for a CI_AS collation, the value is 1 + 65536 + 2*65536 = 196609.

The careful reader may ask where variation-selector sensitivity (VSS) goes into this, and the answer is that this more recent form sensitivity is not reflected in the value for ComparisonStyle.

Let's move to the LCID property. This property returns the Windows Locale ID for the collation. A Windows locale is a system of rules that applies to a certain language and culture. This includes rules for comparing and sorting, just like a collation. But it also includes rules for things of no concern for a collation in SQL Server like formatting of date, time, numbers and currency. Or in short, a Windows locale covers everything you find in your regional settings in the Control Panel. The LCID property returns a decimal number. These decimal numbers can be seen in several places in Windows.

But while the LCID is often presented as decimal, it is not a simple running number, but it falls into several parts. The lower 10 bits denotes the language, while the upper bits denote a variation of that language. (Those upper bits also fall into several parts, but we do not need to concern us about that here.) For example, 1033 (0x0409) is English (United States), while 2057 (0x809) is English (United Kingdom). Many of these language variations are related to countries. Basically, for all countries where a language is spoken there is a local variation. But a variation can also be related to alternate sorting methods. As one example, you may recall that for German there is both dictionary order (äöü co-sort with aou) and phonebook order (they co-sort with ae/oe/ue), and there is an LCID for each. (Or, since German is spoken in several countries, there are likely to be multiple LCIDs for each.)

For a reference and a list of LCIDs, I've found one list with numbers in hex, which is from the formal specification of LCIDs. I've also found a list of LCIDs with decimal numbers. This is from a general specification for Office.

Let's now tie LCIDs and collations together. When the SQL Server team introduced the nvarchar data type, they could maybe have found a way to piece in the tens of thousands new characters into the existing sort orders. But rather than reinventing the wheel, they apparently decided to make use of the work by the Windows team and work from the locales they had defined. In SQL 7, this connection was more overt in the sense that when you installed SQL Server you had in fact to select a locale for Unicode, not a collation. The SQL 7 setup then had checkboxes for the different types of sensitivity. But what you selected when you selected a locale, was in fact a collation family. And starting with SQL 2000, we select a collation family, and there is not much talk about the locale.

But the locale still is there, and all collations within the same family have the same LCID. Conversely, collations from two different families have different LCIDs but there are some exceptions. Or at least there are, if we think of a collation family as I have presented it so far, that is "the part that comes before the version number". There are some families that by that definition share an LCID, but a more accurate description is that these families were renamed when new version-90 or version-100 collations were introduced. This table lists all such cases:

Family name version-80Family name version-100
Chinese_PRCChinese_Simplified_Pinyin
Chinese_PRC_StrokeChinese_Simplified_Stroke_Order
Chinese_Taiwan_BopomofoChinese_Traditional_Bopomofo
Chinese_Taiwan_StrokeChinese_Traditional_Stroke_Count
Danish_NorwegianDanish_Greenlandic
JapaneseJapanese_XJIS
Korean_WansungKorean

E.g., Japanese_CI_AS and Japanese_XJIS_100_CI_AS both have the LCID 1041.

Thus, a proper definition of a collation family would be "all collations that have a certain LCID", but since the number of renamed families is small, I have permitted myself to mainly talk about the collation family as the part you see in the collation name.

There are many more LCID than there are collation families. There are several reasons why there is no collation family for a specific LCID:

Finally, I was able to find two undocumented properties that collationproperty accepts. Name is not very startling, as it returns the name itself. The other is CollationID, which gives the internal id of the collation. There is little practical value in it, but if you run this query, you will see that there is a pattern in how they are assigned:

SELECT *, Collationproperty(name, 'CollationID')
FROM   sys.fn_helpcollations()
ORDER  BY Collationproperty(name, 'CollationID')

I can see the values from the ComparisonStyle property being reflected. But apart from that I have not dived into the fine details. That may be more than even I want to know about collations.

SQL Collations

We will now turn over to the SQL collations. As I discussed earlier, they are legacy collations and we will not dive equally deep into these as we did with the Windows collations.

We will first make a general overview over the fairly small set of SQL collations. Next, we will look at some properties where the rules are different between varchar and nvarchar for SQL collations. We will also look at a collation feature which is unique to SQL collations: uppercase preference. The last section of this chapter takes us out into the undocumented land of compatibility collations.

General Observations about SQL Collations

To recap: For SQL collations there are different rule sets for varchar and nvarchar. For varchar, the SQL collations take their rules from the old sort orders in SQL 7 and earlier. For nvarchar, an SQL collation is equivalent to a Windows collation, and at the end of this section, I will give you query that will give you the mapping

To see all SQL collations, we can use this query:

SELECT name, description 
FROM   sys.fn_helpcollations() 
WHERE  name LIKE 'SQL%'
ORDER  BY name

If we first look in the column description, we see things like: Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data. That is, we can see the old sort-order id here, although that information is of little use today. This description, by the way, is for the collation SQL_Latin1_General_CP1_CI_AS.

Looking at the names, we can recognise them as consisting of a number of components, although it is not equally systematic as with the Windows collations. We can discern these components.

The selection of collation families is worth some amusing comments. Most of the collation families are languages or Latin1_General. Interestingly enough, there are several collations with Latin1_General in the name where the code page is for a non-Latin script, but I have not investigated how these collations actually work.

Some collation families are more odd-shaped. The list is topped by SQL_1xCompat_CP850_CI_AS. May 1x refer to SQL Server 1.x? The mind boggles! There are some AltDiction collations and one MixDiction. To find out what this may be about, I started the VM where I have SQL 6.5 installed. To wit, with SQL 6.5 you got text files defining the sort orders. It seems that the main difference between AltDiction collations and "regular" collations lies in the secondary-level sorting of accented characters.

And then there are a couple of collations where the "collation family" is EBCDIC with a number, supposedly intended to provide compatibility with old mainframe systems. As a matter of fact, this is where the group of SQL collations may have some growth potential. Two SQL collations were added in SQL 2014, and they are both EBCDIC collations. Furthermore, some time back I had a discussion on the Microsoft Q&A forum with someone who claimed that a certain EBCDIC collation handled some punctuation characters incorrectly, and the outcome was that the poster would open a case with Azure Support to get a new EBCDIC collation. (That hasn't happened yet, though.) I also note that the EBCDIC collations are all case- and accent-sensitive.

When it comes to the collation families that are for specific languages or Latin1_General, most of them, but not all, have both a CI_AS and CS_AS collation. There is also a small number of SQL collations that are both case- and accent-insensitive.

It is also worth taking note of something that is not there: In the section on the history of collations, I mentioned that SQL 6.5 introduced sort orders for East-Asian languages. But for whatever reason, these sort orders did not become SQL collations in SQL 2000, but all SQL collations are truly 8‑bit for varchar. We will see later in this chapter where the East-Asian sort orders are hiding.

I said previously that for SQL collations there are two sets of rules. There is one set of rules for varchar, which follows from the legacy sort order, and there is one set for nvarchar which follows from the LCID of the collation. The nvarchar part is always the same as some Windows collation. I don't want to take up a space with listing a full table for all 77 SQL collations, but here is a query to give you the mapping:

SELECT s.name, w.name
FROM  (SELECT * FROM sys.fn_helpcollations() WHERE name LIKE 'SQL[_]%') AS s
JOIN  (SELECT * FROM sys.fn_helpcollations() WHERE name NOT LIKE 'SQL[_]%') AS w ON
         collationproperty(s.name, 'Version') = collationproperty(w.name, 'Version')
     AND collationproperty(s.name, 'LCID') = collationproperty(w.name, 'LCID')
     AND (convert(int, collationproperty(s.name, 'ComparisonStyle')) <> 0 AND
          collationproperty(s.name, 'ComparisonStyle') = 
             collationproperty(w.name, 'ComparisonStyle') OR
         (convert(int, collationproperty(s.name, 'ComparisonStyle')) = 0 AND
          right(s.name, 4) = right(w.name, 4)))
ORDER BY s.name

That is, a pair of SQL and Windows collations match on the Version, LCID and ComparisonStyle properties from collationproperty with some extra logic to deal with that ComparisonStyle = 0 covers both binary collations and CS_AS_KS_WS collations. If you look at the result from the query, you will find that all Windows collations in the mapping are version-80 collations.

You may ask for a reference that documents what I just said. The answer is that I don't have any, but this is something I've derived from common sense and observations. If you want proof for a certain pair of a collations, you can run a query like this one that compares the sorting between Latin1_General_CI_AS and SQL_Latin1_General_CP1_CI_AS:

WITH chars AS (
   SELECT nchar(value) AS ch, value
   FROM   generate_series(0, 65535)
), numbering AS (
   SELECT value, ch, 
          rank() OVER(ORDER BY ch COLLATE Latin1_General_CI_AS) AS Wind,
          rank() OVER(ORDER BY ch COLLATE SQL_Latin1_General_CP1_CI_AS) AS SQL
   FROM   chars
)
SELECT *
FROM   numbering 
WHERE  Wind <> SQL
ORDER BY value

It returns no rows, which means that rank produces exactly the same result for both collations. It is true, though, that this query does not reveal differences due to digraphs (such cs sorting as a letter of its own in Hungarian) or differences in secondary-level sorting of accents.

And to repeat, this applies to the nvarchar part of the SQL collations. For the varchar part, they have their own rules and are typically not equivalent to a Windows collation.

How SQL Collations are Different

In this section we will look at how SQL collations produce different results for varchar and nvarchar. Since the most commonly used collation in the SQL Server world is SQL_Latin1_General_CP1_CI_AS, we will mainly work with this collation. Just because this collation is so commonly used does not mean that it is the best collation out there. To wit, as we shall see, there are some funny quirks. It has CI in the name, which suggests it is case-insensitive, but it turns out that this is not 100 percent true. Look at this:

CREATE DATABASE SQL_Latin1_General_CP1_CI_AS 
        COLLATE SQL_Latin1_General_CP1_CI_AS
go
USE SQL_Latin1_General_CP1_CI_AS
go
SELECT 'Košice' AS name,
       IIF( 'Košice' =  'KOŠICE', 'Equal', 'Not Equal') AS varchar,
       IIF( 'Košice' =  'KOšICE', 'Equal', 'Not Equal') AS varchar2,
       IIF(N'Košice' = N'KOŠICE', 'Equal', 'Not Equal') AS nvarchar
go

Košice is the second biggest city in Slovakia. This is the output

Output from the query. It reads Košice, Not equal, equal, equal

So despite the collation being case-insensitive, Košice in mixed case is not equal to KOŠICE in all uppercase in varchar. The column varchar2 suggests that the crux of the biscuit is the letter š. That is, this letter is case-sensitive in this collation. To understand this, run this query:

SELECT ASCII('Š'), ASCII('š'), Unicode(N'Š'), Unicode(N'š')

The values returned are 138, 154, 352 and 353. This may seem a little puzzling, since this is a Latin‑1 collation, and for Latin‑1 the code-point values are supposed to be the same as in Unicode. But that applies only to the range 160-255, which is the range for printable characters in ISO-8859‑1 which leaves the range 128-159 to control characters. As I discussed earlier, some vendors could not resist the temptation to enhance Latin‑1 by adding extra characters in the range 128-159, and Microsoft decided that S with a háček (also known a "caron") on top deserved a place in their code page 1252. (The choice is not entirely arbitrary. CP 1252 is for the Western world, to which Finland is counted, and the letter is used, if only marginally, in Finnish.)

But why would that alone make the comparison of š case-sensitive? When I look in the file for sort order 52 that I found in my SQL 6.5 installation (see previous section), I see that all characters in the range 128-159 are listed as undefined among all the punctuation characters. Thus, it seems that when this sort order was defined, Microsoft had not yet "enhanced" CP1252 with the extra characters. But why did not Microsoft change this later? This is something I covered already in the beginning of the article: collations are cast in stone, since changing a collation – or a sort order – would affect existing indexes.

However, far from all SQL collations for CP1252 have this behaviour. Here is a collation where Š and š are properly handled as letters:

CREATE DATABASE SQL_SwedishStd_Pref_CP1_CI_AS 
        COLLATE SQL_SwedishStd_Pref_CP1_CI_AS
go
USE SQL_SwedishStd_Pref_CP1_CI_AS
go
SELECT 'Košice' AS name,
       IIF( 'Košice' =  'KOŠICE', 'Equal', 'Not Equal') AS varchar,
       IIF( 'Košice' =  'KOšICE', 'Equal', 'Not Equal') AS varchar2,
       IIF(N'Košice' = N'KOŠICE', 'Equal', 'Not Equal') AS nvarchar

This outputs Equal for all three columns. Apparently, when this sort order was defined, CP1252 had been augmented in the range 128-159.

There are a few more letters that are case-sensitive for this reason in SQL_Latin1_General_CP1_CI_AS: Œ and œ, Ž and ž, Ÿ and ÿ.

Let's also look at sorting with SQL collations. We create a table with characters to sort in the database SQL_Latin1_General_CP1_CI_AS. The table has both a char(1) and an nchar(1) column, and we store the same characters in both columns:

USE SQL_Latin1_General_CP1_CI_AS
go
CREATE TABLE sortchars(no  int      NOT NULL,
                       ch  char(1)  NOT NULL,
                       nch nchar(1) NOT NULL,
                       CONSTRAINT pk_sortchars PRIMARY KEY (no)
)
go
INSERT sortchars(no, ch, nch)
   VALUES(1,  ' ', N' '),
         (2,  '-', N'-'),
         (3,  '@', N'@'),
         (4,  '+', N'+'),
         (5,  '€', N'€'),
         (6,  '1', N'1'),
         (7,  'A', N'A'),
         (8,  'Š', N'Š'),
         (9,  'š', N'š'),
         (10, 'Z', N'Z')

Then we sort it by the nch column:

SELECT no, ch, nch
FROM   sortchars
ORDER  BY nch 

The output follows the no column.

Output from the SELECT query. Order follows the column no.

We change the ORDER BY clause to read:

ORDER BY nch COLLATE Latin1_General_CI_AS

The output is the same. As I showed above, the nvarchar part of SQL_Latin1_General_CP1_CI_AS is equivalent to Latin1_General_CI_AS.

Let's now sort by the ch column while keeping the COLLATE clause:

ORDER BY ch COLLATE Latin1_General_CI_AS

The order is still the same, which is not very surprising. We are sorting by a Windows collation, which has the same rules for varchar and nvarchar.

But if now remove the COLLATE clause, so that we sort by the database collation, SQL_Latin1_General_CP1_CI_AS, things change:

SELECT no, ch, nch
FROM   sortchars 
ORDER  BY ch

This is the output:

Output when we ssort by the ch column,

The plus character now jumps up and sorts before the minus character. This is not overly remarkable, as there is no given order for punctuation characters. Rather it is an arbitrary choice for how to sort them, when the original sort orders were defined, there was one choice (which appears to have been to sort punctuation characters by their code points). When Microsoft designed the Windows locales which are the base for the Windows collations, they made a different choice, possibly drawn from something generally accepted in the Unicode world. What is more remarkable in the output is that Š and š jump up and sort before the digit. But we already know why: as far as this collation is concerned, these two characters are not alphabetic characters, but undefined punctuation characters.

Let's also try with a more "modern" SQL collation:

SELECT no, ch, nch
FROM   sortchars 
ORDER  BY ch COLLATE SQL_SwedishStd_Pref_CP1_CI_AS

We get this output:

Output when sorting by  SQL_SwedishStd_Pref_CP1_CI_AS

The plus character sorts before the minus and the @ sign with this collation as well, but Š and š now sort between A and Z as expected.

Lest you think that all SQL collations sort punctuation characters in the same way, let me also show you this:

SELECT no, ch, nch
FROM   sortchars 
ORDER  BY ch COLLATE SQL_EBCDIC278_CP1_CS_AS

The output looks like nothing we have seen before:

The sortcharcs table sotred by an EBCDIC collation

Please don't ask what is going on, but I guess it is related to EBCDIC.

The last thing we will do in this section is to look at case-sensitive sorting. Previously, we worked with this table:

USE Latin1_General_100_CI_AS
go
DROP TABLE IF EXISTS CaseSorting
CREATE TABLE CaseSorting (no   int         NOT NULL PRIMARY KEY,
                          word varchar(30) NOT NULL, 
                          data int         NOT NULL)
INSERT CaseSorting (no, word, data)
   VALUES ( 1, 'Andrew',      9), 
          ( 2, 'ascertain',  18), 
          ( 3, 'Australia',  76),
          ( 4, 'Mixed+Case', 99),
          ( 5, 'MixedCandy', 75),
          ( 6, 'Mixedcase',   6), 
          ( 7, 'MIXEDCASE',  13), 
          ( 8, 'mixedcase',  37), 
          ( 9, 'MixedCase',  58),
          (10, 'Mixed-Case', 46)

We sorted this table with a case-sensitive Windows collation:

SELECT no, word, data
FROM   CaseSorting
ORDER  BY word COLLATE Latin1_General_100_CS_AS, data
Which gave us this result:

Data in sortchars sorted with case-sensitive Windows collations. Lowercase come before uppercase.

That is, case has a secondary weight, as has the hyphen, and lowercase sorts before uppercase. With an SQL collation, it is different:

SELECT no, word, data
FROM   CaseSorting
ORDER  BY word COLLATE SQL_Latin1_General_CP1_CS_AS

Here is the output;

THe CaseSorting  table sorted by a case-sensitive SQL collation

The concept of secondary weight for case exists for SQL collations as well, but they sort uppercase before lowercase. And they don't have that extra twist for the hyphen.

Uppercase Preference and Tertiary Weights

We will now look at a collation property that only appears with the varchar part of SQL collations, to wit, uppercase preference. I think the best way to explain what it is all about is by showing an example.

We will work with this table:

DROP TABLE IF EXISTS PrefSorting
CREATE TABLE PrefSorting (guid uniqueidentifier NOT NULL PRIMARY KEY DEFAULT newid(),
                          word varchar(30)      NOT NULL, 
                          data int              NOT NULL)

INSERT PrefSorting (word, data)
  VALUES ('Mixedcase',  13), 
         ('MIXEDCASE',  13), 
         ('mixedcase',  13), 
         ('MixedCase',  13),
         ('Mixedcase',  55), 
         ('MIXEDCASE',  55), 
         ('mixedcase',  55), 
         ('MixedCase',  55),
         ('Mixedcase',  89), 
         ('MIXEDCASE',  89), 
         ('mixedcase',  89), 
         ('MixedCase',  89)

First, we sort the data with the common collation SQL_Latin1_General_CP1_CI_AS, which does not have uppercase preference:

SELECT word, data
FROM   PrefSorting
ORDER  BY word COLLATE SQL_Latin1_General_CP1_CI_AS, data

The output may look like this:

The data in PrefSorting sorted according to the previous query.

The exact order is random which you can see if you re-run the loading script and the SELECT query a few times. The values in the data column always come in the groups 13, 55 and 89 in that order. But within those groups, the variations of MixedCase come in no particular order. This is not very puzzling. We are sorting by a case-insensitive collation, so the result is non-deterministic.

See now what happens when we sort by a collation with Pref in the name:

SELECT word, data
FROM   PrefSorting
ORDER  BY word COLLATE SQL_SwedishStd_Pref_CP1_CI_AS, data

Now we get:

Outpuot from query with ORDER BY SQL_SwedishStd_Pref_CP1_CI_AS.

When we re-run, we get the same result every time, so it seems to be deterministic. In a collation with uppercase preference, case has a tertiary weight, meaning that case only matters if two rows have the same values for all columns in the ORDER BY clause. So in this query, the sorting is by:

  1. The value of word, disregarding case. (Nevermind, that all rows have the same word in this example.)
  2. The value in data.
  3. The case of word, with uppercase coming before lowercase.

This third level of sorting, results in a complication when it comes to indexes. Consider this variation, where we specify the collation in the table definition and add an index on word and data:

DROP TABLE IF EXISTS PrefSorting
CREATE TABLE PrefSorting (
       guid uniqueidentifier NOT NULL PRIMARY KEY DEFAULT newid(),
       word varchar(30) COLLATE SQL_SwedishStd_Pref_CP1_CI_AS NOT NULL, 
       data int         NOT NULL)

INSERT PrefSorting (word, data)
  VALUES ('Mixedcase',  13), 
         ('MIXEDCASE',  13), 
         ('mixedcase',  13), 
         ('MixedCase',  13),
         ('Mixedcase',  55), 
         ('MIXEDCASE',  55), 
         ('mixedcase',  55), 
         ('MixedCase',  55),
         ('Mixedcase',  89), 
         ('MIXEDCASE',  89), 
         ('mixedcase',  89), 
         ('MixedCase',  89)

CREATE INDEX word_data_ix ON PrefSorting(word, data)

SELECT word, data
FROM   PrefSorting
ORDER  BY word, data

Normally, when you have a query where you order by an index, the plan is an Index Scan without a Sort operator. But in this case, the execution plan looks like this:

Query plan with SELECT - Sort -Compiute Scalar + Index Scan

To be able to perform the uppercase-preference sorting, the optimizer adds a Compute Scalar operator and a Sort operator to the plan. Thus, using a collation with uppercase preference can come with a performance penalty due to extra sorting. However, Microsoft has been ambitious enough to provide a mitigation.

If you check out the properties for the Compute Scalar operator, you will find that it computes this expression:

Scalar Operator(tertiary_weights([tempdb].[dbo].[PrefSorting].[word]))

This function, tertiary_weights is nothing secret, but it is described in the SQL Server Documentation. The topic for the function suggests that you can add a computed column which you can add to your indexes. For instance:

DROP TABLE IF EXISTS PrefSorting
CREATE TABLE PrefSorting (
       guid uniqueidentifier NOT NULL PRIMARY KEY DEFAULT newid(),
       word varchar(30) COLLATE SQL_SwedishStd_Pref_CP1_CI_AS NOT NULL, 
       data int         NOT NULL,
       tertiary AS tertiary_weights(word))

INSERT PrefSorting (word, data)
  VALUES ('Mixedcase',  13), 
         ('MIXEDCASE',  13), 
         ('mixedcase',  13), 
         ('MixedCase',  13),
         ('Mixedcase',  55), 
         ('MIXEDCASE',  55), 
         ('mixedcase',  55), 
         ('MixedCase',  55),
         ('Mixedcase',  89), 
         ('MIXEDCASE',  89), 
         ('mixedcase',  89), 
         ('MixedCase',  89)

CREATE INDEX word_data_ix ON PrefSorting(word, data, tertiary)

SELECT word, data
FROM   PrefSorting
ORDER  BY word, data

The Sort operator is no longer there.

So far, I have suggested that you can tell whether a collation has uppercase preference from the presence of the Pref component in the collation name. However, there are quite a few SQL collations that have uppercase preference, despite Pref not appearing in the name. For a full list of collations with uppercase preference see the topic for tertiary_weights (where they are referred to as SQL tertiary collations).

Final reminder: Uppercase preference only applies to varchar, not to nvarchar. Recall that with nvarchar, an SQL collation is just a Windows collation in disguise. There are no case-insensitive Windows collations with uppercase preference, a feature I personally find to be of dubious value.

Compatibility Collations

We will now look into a very obscure and entirely undocumented corner of SQL Server. Readers who feel that they don't want to know exactly everything about collations have all reason to skip this section.

You may recall that earlier in the article, I had a query to find 172 deprecated Windows collations. You may also recall that I stopped at 220, because going from 219 to 220 did not add any new collations.

But then I recalled that an MVP colleague at some occasion had mentioned something about a very old database that ran with a "compatibility collation". Hm, I thought, could there be more collations I did not know of? Eventually, curiosity killed the cat, and on Azure SQL Managed Instance, I ran this query to test all 232 possible collation ids:

DROP TABLE IF EXISTS collations
CREATE TABLE collations (name sysname NOT NULL PRIMARY KEY)
go
WITH all_collation_names(name) AS (
   SELECT DISTINCT convert(nvarchar(128), collationpropertyfromid(value, 'Name')) AS name
   FROM   generate_series(convert(int, 0x80000000), convert(int, 0x7FFFFFFF), 1)
)
INSERT collations(name)
SELECT name
FROM   all_collation_names
WHERE  collationproperty(name, 'Codepage') IS NOT NULL

To repeat what I said when I presented the original query: Just looking up a name from the id is not enough, because collationpropertyfromid can return non-existing collations. Whence the filter on collationproperty to verify that the name returned is an existing collation. The DISTINCT in the CTE is needed, because multiple ID values can return the same name. I mention this, because in this section we will see some more variations.

As you might guess, the query ran for several minutes. When it finally had completed, I almost fell out of my chair. I was looking at this message:

(90493 rows affected)

So in addition to the 5540 official collations and the 173 deprecated, there are almost 85 000 more hiding!

In order to analyse this vast quantity, I ran a query where I extracted the part before the first underscore in the collation name, filtering out the official collations:

WITH firstparts AS (
   SELECT left(c.name, charindex('_', c.name) - 1) AS firstpart
   FROM   collations c
   WHERE  c.name NOT IN (SELECT f.name FROM sys.fn_helpcollations() f)
)
SELECT firstpart, COUNT(*)
FROM   firstparts
GROUP  BY firstpart
ORDER  BY firstpart

This is the result set:

firstpart Count
Azeri 100
Compatibility 84 780
Hindi 18
Korean 18
Lithuanian 18
Macedonian 18
SQL 1

The rows with the "small" numbers are the 173 deprecated collations I have already discussed. The rest are those compatibility collations that my MVP colleague had touched at. To get some idea of these, I took a sample:

SELECT TOP 5 name 
FROM   collations
WHERE  name LIKE 'Compatibility%'
ORDER  BY newid()

I saw names like these examples: Compatibility_112_418_2, Compatibility_96_409_20003, Compatibility_97_415_30001, Compatibility_83_409_10001, Compatibility_112_c1a_30002. At first, I had no idea what those numbers in the names might mean. I went through a number of iterations where I rewrote this section several times as I learnt more and more. Here are the findings I eventually arrived at.

First of all, compatibility with what? Collations as we know them were introduced in SQL 2000, so this is compatibility with SQL 7. You may recall from the section The History of Collations in SQL Server that things were a little funny in SQL 7. To recap: When you installed SQL 7, you selected a sort order for varchar and a Unicode collation (as the term was) for nvarchar, and those were two separate choices. While the Setup UI would nudge you to select a sort order and a collation with the same basic rules, there was nothing to stop you from having a case-sensitive sort order for Hungarian for varchar, and a case-insensitive collation for Chinese for nvarchar.

When Microsoft introduced the unified Windows collations in SQL 2000, they took 74 coherent combinations of sort orders and Unicode collations from SQL 7 and made them into SQL collations. The rest they hid as these undocumented compatibility collations. Since there was no way to tell for Microsoft which combinations that were in use, they had to provide collations for all of them – even if many combinations were highly unlikely. (But certainly not all. As I noted previously, they did not define any SQL collations for East-Asian languages, so a database with a case-insensitive Japanese sort order for varchar and a case-insensitive Unicode collation for Japanese that was upgraded from SQL 7 to SQL 2000 was bestowed with a compatibility collation.)

As I mentioned, at first I had no idea what these numbers in the collation names might be, but eventually I figured out what the numbers are:

  1. The first number is the sort order for varchar. This is a decimal number. There are 108 different values appearing in the names with 785 collations per sort order. (108 * 785 = 84 780.)
  2. The second number is the LCID for the nvarchar part of the collation in hex. In total there are 50 different LCIDs appearing in the collation names, with 1728 (= 108 * 16) collations each, with the exception of LCID 8200 which appears only with 108 collations. (49 * 1728 + 108 = 84 780.)
  3. The third is a bit mask in hex that represents the comparison style for the nvarchar part. There are 16 different values, with 5292 = (49 * 108) collations for each, except for style 0 which appears with 5400 collations, that is, 108 more than the rest. (5292 * 16 + 108 = 84 780.)

In order to analyse the collations and to verify my conclusions, I set up this view, where I used the function collationproperty to get data about the collations:

CREATE OR ALTER VIEW CompatCollations AS
   SELECT name, 
          sortorder     = convert(int, parsename(replace(name, '_', '.'), 3)),
          LCIDName      = upper(parsename(replace(name, '_', '.'), 2)),
          CompStyleName = upper(parsename(replace(name, '_', '.'), 1)),
          Version       = convert(int, collationproperty(name, 'Version')),
          CodePage      = convert(int, collationproperty(name, 'CodePage')),
          LCIDProp      = isnull(
                            nullif(
                             trim('.' FROM trim('0' FROM 
                             convert(char(8), 
                               convert(binary(4), 
                                 convert(int, 
                                    collationproperty(name, 'LCID')
                                 )), 2) + '.')),
                             ''), 
                          '0'),
          CompStyleProp = isnull(nullif(
                             trim('.' FROM trim('0' FROM 
                             convert(char(8), 
                               convert(binary(4), 
                                 convert(int, 
                                    collationproperty(name, 'ComparisonStyle')
                                 )), 2) + '.')),
                              ''),
                          '0'),
          CollationID = convert(int, collationproperty(name, 'CollationID'))
   FROM   collations
   WHERE  name LIKE 'Compatibility%'

The purpose of the unbelievable function nesting for the Prop columns is to convert the decimal numbers to numbers in hex without any leading zero (but retaining 0 if the value is 0) so that the numbers have the same format as the hex numbers in the names.

First quick about the Version column. As expected, it is 0 for all these collations. That is, they are all version-80 collations, just like the official SQL collations.

I will now cover the components in the collation names in more detail to prove my theory, and this will also reveal some odd things. I'm starting at the back with the comparison style, as this is an easy and regular one. For all collations, the hex string in the name matches the value returned by collationproperty. As for the 108 extra collations with style = 0, you may recall that this value is reported both for CS_AS_KS_WS collations and binary collations, so nothing strange there.

Next, we move to the LCID. This is not equally clean, because there are a considerable number of collations where the LCID part of the name does not match the LCID returned by collationproperty. These mismatches fall into two groups. The first group consists of those 108 collations where the LCID part in the name is 8200 that I mentioned above. 8200 is not a "real" LCID, since there is no language 0. This LCID always appears together with comparison style 0, so there is all reason to assume that the nvarchar part of these collations is binary, and a quick test seems to confirm this. (And that would be a binary collation of BIN style, not BIN2, as the latter did not appear until SQL 2005.)

The other group are 1728 collation names each for seven LCIDs which you could say are alternate names for the collations with the LCID reported by collationproperty. That is, these are pairs of collation names have the same collation ID. To take one example, LCID = 0x0429 is Persian, but the collations with _429_ are not Persian collations, but the actual LCID of these collations is 0x0401 (Arabic). Here is a quick example to demonstrate.

DECLARE @t TABLE (a varchar(20) COLLATE Compatibility_52_429_1, 
                  b varchar(20) COLLATE Compatibility_52_401_1)
SELECT * FROM @t WHERE a = b

This runs without error despite the different collation names. Furthermore, the column CollationID in the view has the same value for these two collations. These "extra" LCIDs in the names are LCIDs for which collations first appeared in SQL 2008 with the version-100 collations, and there were no collations related to these LCIDs in SQL 2000. For some reason, they nevertheless appeared prematurely with the compatibility collations already in SQL 2000.

Left to discuss of the components in the names of the compatibility collations is the first one, the sort order. As I have discussed earlier, up to SQL Server 6.5 there were only sort orders, and these sort orders all had a number. A sort order is a wholesale package with code page, language rules, case/accent sensitivity and that. There is no particular pattern with these numbers, although sort orders related to the same code page tend to have ids close to each other.

It would have helped to confirm the theory if collationproperty had accepted an argument to return the sort order, but I have not been able to find any. However, it is possible to retrieve the sort order for a collation this way:

CREATE DATABASE Compatibility_193_40b_2 COLLATE Compatibility_193_40b_2
SELECT databasepropertyex('Compatibility_193_40b_2', 'SQLSortOrder')

Note: This CREATE DATABASE statement does not run in Azure SQL Database or Azure SQL Managed instance. If you try to create a database with a compatibility collation in Azure SQL, you get this error message:

Msg 40665, Level 16, State 1, Line 1

'Compatibility_192_40b_2' is not a supported collation.

You can still use compatibility collations on column or expression level in Azure SQL.

The value returned by databasepropertyex is indeed 193. I did not feel like creating 85 000 databases to verify, though. Instead, I decided that it would be sufficient if I could verify that the code page reported by collationproperty matched the code page for the sort order from some other source. It occurred to me that I could use the compatibility view sys.syscharsets. In this view, the rows with type = 2001 are the sort orders. For these rows, the column csid points to a row with type = 1001, which specifies the code page, which is stored as a string like "cp1251", except for code page 1252, where the string is "iso-1".

From this I composed this query:

SELECT * INTO #syscharsets FROM sys.syscharsets

SELECT *
FROM   CompatCollations CC
JOIN   #syscharsets cs1 ON cs1.id = CC.sortorder
                       AND cs1.type = 2001
JOIN   #syscharsets cs2 ON cs2.id = cs1.csid
                       AND cs2.type = 1001
WHERE  NOT (CC.CodePage <> 1252 AND concat('cp', CC.CodePage) = cs2.name OR
            CC.CodePage = 1252 AND cs2.name = 'iso_1')

Note: The reason I copy the view into a temp table is that some other queries I ran against this view produced incorrect results, a problem that went away when I introduced the temp table.

The idea is here that the WHERE clause filters out all rows where the code page reported by collationproperty matches the code page in sys.syscharsets. The query returns no rows, which is a strong indication for my theory. A slight complication, though, is that if I remove the NOT, the query now returns 78 500 collations, and not 84 780 as you may expect. There are eight sort orders in CompatCollations that are absent from sys.syscharsets. When I checked against the SQL 7 documentation, these sort orders are missing there as well. However, interesting enough, though, five of these eight sort orders do appear in syscharsets on SQL 7. And with the code page matching what I have in CompatCollations. My assumption is that these sort orders were deprecated already in the SQL 7 time frame. So despite this small hiccup, I am very confident that the first component in the names of the compatibility collations is the sort order.

An interesting observation is that none of the EBCDIC sort orders are listed in the SQL 7 documentation, and nor do they appear in syscharsets on SQL 7. Thus, it seems that they were added in SQL 2000. Nevertheless, there are compatibility collations for these sort orders as well. And not only that. As I mentioned earlier, two new SQL collations for EBCDIC were added in SQL 2014. These two collations also introduced two new sort orders, 218 and 219. While no need for compatibility with SQL 7 for these, there are still 785 compatibility collations each for these sort orders. And here is a really interesting fact: these 1570 collations were available already in SQL 2012! Some kind of public preview?

Here is another remarkable oddity: It turns out that some compatibility collations are in fact – Windows collations! I got the idea that the upper 8 bits of the collation id are the sort order, and I ran a query to verify this. I found that it matched in most cases, but not all. This lead me to this query:

SELECT name, CollationID,
       collationpropertyfromId(CollationID, 'Name') AS RealName
FROM   CompatCollations
WHERE  sortorder <> right_shift(CollationID, 24)

This query returns 33 rows. That's too many to list here, but as an example, this runs without a collation conflict:

DECLARE @t TABLE (a varchar(20) COLLATE Compatibility_72_409_30001, 
                  b varchar(20) COLLATE Latin1_General_CI_AS)
SELECT * FROM @t WHERE a = b

What the story is here, I don't know, but I find that all the compatibility collations returned by the query above are missing on SQL 2000. That might have been due to an imperfection of some sort. When they tried to repair this omission in SQL 2005, something else went wrong, it seems. (When I look at how many collations there are by comparison style in SQL 2000, I get the impression that the number of missing collations is greater than 33. So it seems that some collations were restored successfully in SQL 2005. But not all.)

One final observation about the SQL collations. As we have seen, the number of the compatibility collations is the product of the possible values of all properties. And an SQL collation is exactly such a combination of properties which have deserved an "official" stamp. As you might guess, of the 85 780 collation names, some of them share collation id with an SQL collation. This query:

WITH CTE AS (
   SELECT name AS CompatName, 
          convert(sysname, collationpropertyfromid(CollationID, 'Name')) AS SQLName
   FROM   CompatCollations
)
SELECT CompatName, SQLName
FROM   CTE
WHERE  SQLName LIKE 'SQL%'
ORDER  BY SQLName

returns 79 rows, one more than the actual number of SQL collations. This is because there are a few SQL collations that have two alternate names among the compatibility collations due to the LCID issue I discussed earlier. On the other hand, the two SQL collations that are BIN2 collations do not share collation id with any compatibility collation.

Now when we have seen that there are duplicate names for some collation ids, we may ask, what is the actual number of collations in SQL Server? Going back to the collations table I created initially in this section, I ran this query:

SELECT COUNT(*), COUNT(DISTINCT collationproperty(name, 'CollationID')) 
FROM   collations

The initial number of 90 493 turns out to be "only" 78 291.

Before we close this chapter, let's look at an example of how crazy these collations can be. We have already created the database Compatibility_193_40b_2. 193 is a Japanese sort order, and it is akin to the collation Japanese_CI_AS, if not identical. 0x040b is the LCID for Finnish, so the nvarchar part of this collation is the same as Finnish_Swedish_CS_AI_KS_WS. Thus, the varchar and nvarchar parts have completely different sensitivity rules. (The batch below can also be found in the file compatcollations.sql in the zip archive collationscripts.zip.)

USE Compatibility_193_40b_2
go
CREATE TABLE crazy(
    a varchar(20) NOT NULL,
    b varchar(20) NOT NULL,
    "equi-varchar" AS IIF(a = b, 'Yes', 'No'),
    "equi-nvarchar" AS IIF(CAST(a AS nvarchar) = CAST(b AS nvarchar), 'Yes', 'No')
)
INSERT crazy(a, b)
   VALUES('insert', 'INSERT'),
         ('なかめぐろ', 'なかめくろ'),
         ('なかめぐろ', 'ナカメグロ'),  
         ('Paris', 'Paris')
go
SELECT * FROM crazy
SELECT IIF ('Virginia' < 'Washington', 'Yes', 'No') AS varchar,
       IIF (N'Virginia' < N'Washington', 'Yes', 'No') AS nvarchar
Image of the INSERT statement above

We have seen all words before, but to repeat: The second row has Nakameguro in hiragana to the left and the mis-spelling Nakamekuro to the right; the difference lies in the absence of the dakuten accent on the ku syllable. The third row has Nakameguro in hiragana and katakana.

This is the output:

Output from the queries above. All rows have one yes and one no.

As for the last result set: Recall that in Finnish-Swedish collations, W is just a fancy "accented" form of V and they co-sort.

One will have to hope that there is no production database with this collation anywhere. I would not be surprised to learn if I was the first one to ever use this particular collation.

Permit me to close this section with pointing out the obvious: If you are running a database with one of these compatibility collations, you are. Given how difficult it is to change the database collation, you may prefer to continue to use that collation. But it goes without saying that if you are creating a new database, there is absolutely no reason to use any of these arcane collations. They could be good for pranks, though.

Some Unicode Modifier Characters

Unicode includes quite a number of characters that make no sense on their own, but the task of these characters is to modify the preceding character, or to control how two characters are combined. Many of these characters behave the same in all non-binary collations, and as this is an article on collations and not on Unicode in general, we will constrain ourselves to two groups of such characters: Combining diacritical marks, since this is a feature that can catch you off-guard if you encounter them, and variation selectors which are subject to different treatment in the VSS collations.

Combining Diacritical Marks

When the Unicode standard was defined, the original aim was to make it a 16‑bit standard, and there was an anxiety that having a code point for each combination of a base character and a diacritical mark would result in outrunning the 16‑bit range. Therefore, they came up with the idea to have non-spacing diacritical marks that would be combined with the preceding character. That is, rather than having a single code point for the character é, it would be expressed as e + combining acute accent.

In the end, the Unicode consortium settled on having round-trip compatibility with legacy character sets, and for this to be possible, they had to include code points for many precomposed characters. However, the combining diacritical marks also made it to the Unicode standard, and therefore a character like é can be expressed in two ways in Unicode.

Note: On some keyboard layouts, there are mute accent keys. This includes the Swedish layout. To get an e with acute accent, I first need to press the key for acute accent and then the e. While this is reminiscent of combining diacritical marks, the result is a precomposed é, not e + combining acute accent.

The scripts for this section are also in the file combining_diacritical_marks.sql in the archive collationscripts.zip.

Let's start with creating a table and adding some data (you may already have the database from previous scripts):

CREATE DATABASE Latin1_General_100_CI_AS COLLATE Latin1_General_100_CI_AS
go
USE Latin1_General_100_CI_AS
go
DROP TABLE IF EXISTS résumé
CREATE TABLE résumé (id     int          NOT NULL PRIMARY KEY,
                     résumé nvarchar(20) NOT NULL)
INSERT résumé (id, résumé)
  VALUES(1, N'résumé'), 
        (2, N'résumé')

I expect this to render correct in most browsers, but just in case, this is how this script looks like in SSMS:

The last batch of the query above. All instances of résumé look the same.

We run a simple query:

SELECT résumé, COUNT(*) AS cnt
FROM   résumé
GROUP  BY résumé

The output is what you as expected:

The result set of the query above. There is a single row with cnt = 2.

But next query brings a surprise:

SELECT id, len(résumé) AS charlen, 
       substring(résumé, 1, 3) AS FirstThree, substring(résumé, 1, 2) AS FirstTwo
FROM   résumé

Output from query above. For the first query, charlen is 6 and FirstThree is rés, but for the second charlen is 8 and FirstThree is only ré.

What looked like two identical strings, turns out not to be. The first string has six characters, while the second has eight. When we use substring to extract the first two and three characters, it seems that FirstThree is missing one character for id = 2. With FirstTwo, you do get two characters – but where did the accent go? As you may have guessed by now, for id = 1, résumé is constructed with the precomposed é, whereas for id = 2, the accented e is achieved with an unaccented e and a combining acute accent.

We can see this more clearly, if we cast to binary:

SELECT id, résumé, cast(résumé AS varbinary(40))
FROM   résumé

In the binary column, we have these two values:

0x7200E900730075006D00E900
0x720065000103730075006D0065000103

Let's introduce some spaces and also swap the bytes to correct for the little-endianness of the Wintel architecture:

0x0072 00E9 0073 0075 006D 00E9      
0x0072 0065 0301 0073 0075 006D 0065 0301

If we look at the first string, we can see that there are four characters in the ASCII range, and one in the Latin‑1 range which appears twice. The latter is obviously the precomposed é. In the second string, the second character is 0x65, which you may recognise as the normal lowercase ASCII e. (Lowercase letters start on 0x61, and e is the fifth character.) It is followed by 0x0301 which is unlikely to ring a bell with very many readers, but that is the combining acute accent.

You could argue that SQL Server is not handling the combining diacritics in the best way. It does understand that the precomposed and the combining versions are equal, despite that we are in an accent-sensitive database. But should len really return 8 for the combining version? And should substring behave the way it does? Or in other words, should the combining diacritical marks count as characters on their own? Isn't it part of a single four-byte sequence? But you can also argue that this is not the same as UTF‑8 sequences or surrogate pairs, which are purely encoding devices. The diacritical marks are code points in their own right. Out of curiosity, I tried the same thing on Postgres and MySQL, and their substring produced the same result as SQL Server's. So you could say that there is a standard there. Nevertheless, combining diacritics can certainly confuse you if you are not aware of them.

You may ask, how common are combining diacritics in practice? I have no data, but I would guess that precomposed is dominating, at least for European languages. However, I have encountered combining accents in the wild. I've helped an occasional forum poster who had such data, and who did not understand what was going on. I also recall that a couple of years back I was getting promotional mail from a bicycle store here in Stockholm. I happened to read these mails through a Telnet program. I first thought the Swedish characters were just garbled, but eventually I realised that it was a matter of combining diacritics that were not rendered very well. My gut feeling is that most applications render them correctly, and therefore you will not be able to tell immediately that you have combining diacritics in your data. It is not until SQL Server starts to produce some unexpected results that you will know!

Ironically, though, I can offer one example of an application that does not handle combining diacritics correctly: the HTML editor I'm using to write my articles. This how the initial INSERT statement looks in the editor:

Picture how the INSERT statement looks like in my HTML editor.

It seems that someone has misunderstood how the combining diacritical marks should work, because the accents are placed on top of the character following the combining mark and not the one preceding it. I should hasten to add that I'm running a fairly old version of this product, so the bug may have been fixed. (Another interesting behaviour of the editor is that when I saved the HTML file, closed the editor and re-started it, the e + combining accent had been replaced by a pre-composed é. Thus, for the HTML, I had to use an escape sequence for the combining accent.)

We will play a little more with our combining acute accent. We will first try to replace both é and e:

SELECT replace(résumé, 'é', 'x') AS replace_eacute, 
       replace(résumé, 'e', 'x') AS replace_plain_e,
       replace(résumé COLLATE Polish_CI_AI, 'e', 'x') AS replace_AI
FROM   résumé

The result is the same for both rows:

Output of the query above.

We can only replace the plain e, if we employ an accent-insensitive collation.

What if we employ a binary collation? In a binary collation, there is all reason to believe that each character is handled as its own code point.

SELECT id, replace(résumé COLLATE Latin1_General_BIN2, 'e', 'x')
FROM   résumé

And indeed, look at this:

Output of the query above. For id = 2, it is now rxsumx with accents on the x:s!

We were able to replace the e, but the accent is still there. And here you can kind of see the point with these combining marks. Say that you are a linguist, and you are working with a language that is hitherto unwritten. You find that there is this x-like sound, and you think the perfect character for the sound would be an x with an acute accent. Unicode has it for you!

Let's go even wilder and replace the e with a digit:

UPDATE résumé
SET    résumé = replace(résumé COLLATE Latin1_General_BIN2, 'e', '8')
FROM   résumé

SELECT id, résumé, charindex('8', résumé)
FROM   résumé

This time, things break down a little bit:

Output of the query above. The accents are now in the wrong place. But charindex return 0 for both.

The accents now appear in the wrong place, like in the example from my HTML editor above. But don't be fooled. This is just a rendering detail. Logically, there is still an accented 8 in the string, and the result of 0 from charindex tells us that there is no unaccented 8 in that string.

Likewise, this query returns no rows:

SELECT * FROM résumé WHERE résumé = 'r8sum8'

But cast to an accent-insensitive collation and it does:

SELECT * FROM résumé WHERE résumé = 'r8sum8' COLLATE Czech_CI_AI

Variation Selectors

Variation selectors are a little bit on the esoteric side (but not as esoteric as the compatibility collations!), and readers who are not working with Japanese but feel that they can skip topics for which they will never have practical use are free to do so. A second reason to skip this section might be that this is an area which is beyond my realm of knowledge. To be able to write this section, I had to read Wikipedia and google around quite a bit. There can certainly be inaccuracies and errors in this section due to my misunderstandings.

Unicode defines a number of so-called variation selectors. These characters are similar to combining diacritical marks in that they modify the preceding character. But while the combining diacritics all are characters that have a graphical representation if printed alone, this is not true for the variation selectors. A variation selector on its own carries no meaning, but displays as an empty rectangle, the regular font-fallback character.

Unicode defines a number of sequences with variation selectors as listed on the page StandardizedVariants-15.1.0.txt. But as I understand it, if you are creating your own font, you could define your own sequences to permit for glyph variants.

We need a database and a table to work with:

CREATE DATABASE Czech_100_CS_AS_SC COLLATE Czech_100_CS_AS_SC
go
USE Czech_100_CS_AS_SC
go
CREATE TABLE VSelectors (id       int          NOT NULL,
                         GrpName  nvarchar(30) NOT NULL,
                         seq      nvarchar(10) NOT NULL,
                         CONSTRAINT pk_VSelectors PRIMARY KEY (id)
)

The database has a Czech collation, but that is only to keep the collation name short; we will not work with Czech in this section. What is important, though, is that the collation is surrogate compatible, because we will stray outside the Unicode base plane.

While variation-selector sensitivity is only available with Japanese collations, we will start elsewhere and our first example is from Mongolian. Mongolian can be written in Cyrillic script and this is possibly more common today, but we will work with a character from the traditional Mongolian script, more precisely the character MONGOLIAN LETTER A, and the two variants listed on StandardizedVariants-15.1.0.txt. (My understanding of the rationale for the Mongolian variation selectors is that many letters in the Mongolian script take different shapes depending on where in the word the letter appears. This may be related to Mongolian being written in vertical lines.)

In this section, the scripts are all ASCII, and all characters and variation selectors are given with their Unicode code points in hex.

INSERT VSelectors(id, GrpName, seq)
   VALUES(1, 'Mongolian', nchar(0x1820)),
         (2, 'Mongolian', nchar(0x1820) + nchar(0x180B)),
         (3, 'Mongolian', nchar(0x1820) + nchar(0x180C))

To study what we have, we run this query:

SELECT id, GrpName, seq, len(seq) AS charlen, 
       rank() OVER (PARTITION BY GrpName ORDER BY seq) AS rank
FROM   VSelectors 
ORDER  BY id

Besides looking at the glyphs as such, we also inspect how many characters SQL Server thinks there are in the string, and we use the rank function to see if SQL Server thinks the strings are different from each other or equal. This is the output:

Output from the previous query

When we look at the glyphs, we can see that the second character is distinctively different from the first. However, the third looks very similar to the first, and indeed they are. I've copied and pasted into Word, increased the font size, and even installed a Mongolian font, but, no, they remained identical. As we shall see, this is a recurring theme with variation selectors.

Looking at the column charlen, we can see that SQL Server counts a variation selector as a character on its own, just like it does with a combining diacritic mark. The column rank tells us that all three sequences are considered equal to SQL Server. The collation is accent-sensitive, but the variation selectors are not diacritical marks that may alter pronunciation or stress. They only serve to produce a graphically different version of the preceding character, so therefore the strings are considered to be equal.

Because Mongolian has so many glyph variants, Mongolian has its own variation selectors in the Mongolian block. But Unicode also has a block of general variation selectors from U+FE00 to U+FE0F, usually referred to as VS1, VS2 etc up to VS16. Far from all of them are used for any officially defined variation sequences. Here is an example the Burmese character KA, taken from StandardizedVariants-15.1.0.txt:

INSERT VSelectors(id, GrpName, seq)
  VALUES(11, 'Burmese', nchar(0x1000)),
        (12, 'Burmese', nchar(0x1000) + nchar(0xFE00))

We run the SELECT query above to look at the data. If you run it in a normal SSMS window, you will have severe difficulties to sense any difference between the Burmese glyphs, but here is the output from of an SSMS window where I have the font size set to 26. (It's on a VM that I use for presenting my sessions where I need a big font size so that the audience can read.).

Output from the same SELECT query, now with Burmese characrters added

If you look closely, you can see that the second glyph has a small dot to the left. What this dot signifies, I don't know, but it is the result of the variation selector. We can also see that the result for charlen and rank are the same as for the Mongolian characters. (As for the rank, recall that the OVER clause has a PARTITION BY GrpName, so it is not that the Mongolian and Burmese characters are considered equal.)

While many sequences with variation selectors relate to East-Asian languages, there are also quite a few related to mathematical symbols. On the top of StandardizedVariants-15.1.0.txt we find:

0030 FE00; short diagonal stroke form; # DIGIT ZERO

So let's try that:

INSERT VSelectors(id, GrpName, seq)
  VALUES(21, 'Zero', N'0'),
        (22, 'Zero', N'0' + nchar(0xFE00))

But the result of our SELECT query is disappointing:

Display of DIGIT ZERO without and with variation selector - there is no visible difference

In this particular font, Consolas, the zero always comes with that diagonal stroke. But I have also copied into Word and tried different fonts. There is never any difference. Either both zeroes have a stroke, or none of them have. And the same is true for the other mathematical characters I have tried; the glyphs are always the same, and the variation selector has no visual effect. I guess the answer is not implemented yet.

Another group of characters where variation selectors come into play are emojis. For these, VS15 is used to denote monochrome/text version and VS16 specifies that the coloured form should be used. Here is an example with two different emojis:

INSERT VSelectors(id, GrpName, seq)
  VALUES(31, 'Emoji', nchar(0x2764) + 
                      nchar(0x01F612)),
        (32, 'Emoji', nchar(0x2764) + nchar(0xFE0E) + 
                      nchar(0x01F612) + nchar(0xFE0E)),
        (33, 'Emoji', nchar(0x2764) + nchar(0xFE0F) + 
                      nchar(0x01F612) + nchar(0xFE0F))

SELECT id, GrpName, seq, len(seq) AS charlen, 
       rank() OVER (PARTITION BY GrpName ORDER BY seq) AS rank
FROM   VSelectors 
WHERE  id BETWEEN 31 AND 33
ORDER BY id

The result set again seems to be disappointing:

Result of the queries. All emojis appears to look the same.

However, SSMS is a text editor, and not really aimed for emojis. Copy the seq column into Word and you see this:

The emojis pasted into Word. Now they are different.

When there is no variation selector we get the default form, which sometimes is the black-and-white version, but more often, it seems, the coloured version.

After all these examples, let's now approach the variation-selector-sensitive (VSS) collations we skipped earlier in the article. As I mentioned, they were introduced in SQL 2017 with a number of Japanese version-140 collations. Let's first change the collation of our VSelectors table to a variation-selector-insensitive 140-collation for an interesting observation:

ALTER TABLE VSelectors 
   ALTER COLUMN seq nvarchar(10) COLLATE Japanese_XJIS_140_CI_AS NOT NULL
SELECT id, GrpName, seq, len(seq) AS charlen, 
       rank() OVER (PARTITION BY GrpName ORDER BY seq) AS rank
FROM   VSelectors 
ORDER BY id

This is the output:

All characters with variation sequences with a 140-collation. The charlen column is the interesting thing.

The values in the charlen column have changed! In these collations, the variation selectors do not count as separate characters, but they are part of the character they modify. This change does not seem to apply to combining diacritical marks, but they count as characters of their own in 140-collations as well.

Let's now try a VSS collation, and we expect changes in the rank column, since the variation selectors are supposed to make a difference.

ALTER TABLE VSelectors 
   ALTER COLUMN seq nvarchar(10) COLLATE Japanese_XJIS_140_CI_AS_VSS NOT NULL
SELECT id, GrpName, seq, len(seq) AS charlen, 
       rank() OVER (PARTITION BY GrpName ORDER BY seq) AS rank
FROM   VSelectors 
ORDER BY id

But, no, the output is the same as above! At this point, I scratched my head, but then I said, after all, the VSS collations are all Japanese. To find a Japanese example, I googled, and I got a hit on a thread on StackExchange where a user was having problem with a variation selector not working as intended, and she had an example with a kanji character and the selector U+E0100, that is, a character in one of the supplementary planes. Through Wikipedia, I learnt that there is a Variation Selectors Supplement for selectors VS17-VS256. And apparently, these are the selectors used for Japanese.

I wanted to find more examples, and went to Google again. This time I found a short academic-style paper, Recent Trends in Standardization of Japanese Character Codes by Taichi Kawabata, which gives some background and details. The article includes one example character, of which the author explains the difference between two variants. He also gives a few more variants, although the pictures are a little blurred so the differences are not entirely apparent. (Interesting enough, the example character he uses is exactly the one I found in the StackExchange thread.)

Let's insert some of the examples from the paper by Kawabata-san, and also the selector in the StackExchange post. To verify that these selectors work like the other ones, we flip back to a non-VSS collation:

ALTER TABLE VSelectors 
   ALTER COLUMN seq nvarchar(10) COLLATE Japanese_XJIS_140_CI_AS NOT NULL
INSERT VSelectors(id, GrpName, seq)
   VALUES(41, 'Japanese',    nchar(0x845B)),
         (42, 'Japanese',    nchar(0x845B) + nchar(0x0E0100)),
         (43, 'Japanese',    nchar(0x845B) + nchar(0x0E0102)),
         (44, 'Japanese',    nchar(0x845B) + nchar(0x0E0103)),
         (45, 'Japanese',    nchar(0x845B) + nchar(0x0E0106)),
         (46, 'Japanese',    nchar(0x845B) + nchar(0x0E0107)),
         (47, 'Japanese',    nchar(0x845B) + nchar(0x0E0109))
SELECT id, GrpName, seq, len(seq) AS charlen, 
       rank() OVER (PARTITION BY GrpName ORDER BY seq) AS rank
FROM   VSelectors
WHERE  id BETWEEN 41 AND 47
ORDER BY id

The behaviour is the same as for the previous groups. This is the output:

The output of the query with the Japanese character

As for the glyphs, I don't think there is any difference with this font (Consolas). But I copied the characters into Word (and removed the line breaks), and selected the font Yu Gothic Medium, which is said to be aimed for Japanese.

The Japanese characters in the font Yu Gothic Medium.

The second glyph, the one with selector U+E0100, is different. (Look at the lower part to see the difference.) However, the other glyphs all seem to be the same, which is reminiscent of the initial Mongolian example.

Let's now once more switch to a VSS collation and see if the rank column changes this time

ALTER TABLE VSelectors 
   ALTER COLUMN seq nvarchar(10) COLLATE Japanese_XJIS_140_CI_AS_VSS NOT NULL
SELECT id, GrpName, seq, len(seq) AS charlen, 
       rank() OVER (PARTITION BY GrpName ORDER BY seq) AS rank
FROM   VSelectors 
WHERE  id BETWEEN 41 AND 47
ORDER BY id

Indeed, it does:

The Japanese characters now with a VSS collation.

I will have to say that I am a little puzzled over why Microsoft decided to consider only the supplemental variation selectors for variation-selector sensitivity and not the regular variation selectors or the Mongolian selectors. But maybe the case is that in Japanese the selectors can make a difference between different words, while this is not the case in other languages.

End of the story? It could have been, but not really. I decided to try one of the new version-160 collations:

ALTER TABLE VSelectors 
   ALTER COLUMN seq nvarchar(10) 
      COLLATE Chinese_Simplified_Pinyin_160_CI_AS_SC_UTF8 NOT NULL
SELECT id, GrpName, seq, len(seq) AS charlen, 
       rank() OVER (PARTITION BY GrpName ORDER BY seq) AS rank
FROM   VSelectors 
WHERE  id BETWEEN 41 AND 47
ORDER BY id

The result was surprising:

The Japanese characters, now according to a verison-160 Chinese collation

charlen is again 2, and the collation is variation-selector sensitive, despite VSS not being in the name? I scratched my head, and thought, yeah, maybe this matters in Chinese as well. Eventually, I changed 160 to 100 in the collation name. Same result. Flipped back to the database collation. Same result. The conclusion has to be that in the version-100 collations as well as the version-160 collations, the selectors in the Variation Selectors Supplement are not understood as variation selectors, but as something else.

It seems that when Microsoft implemented the version-160 collations, they did not use the version-140 collations as the starting point. (We have already observed this when it comes to the names.) Rather, it seems that they took the version-100 Chinese collations and added what was required for the Chinese market.

Collations Considerations for ISVs

As we have noted, the system and database collations apply not only to the data, but also to the metadata in the system catalogue. If you are writing software for an in-house application where there will only be one production instance, this is less of an issue. Presumably, your development environment and your production environment have the same collation.

On the other hand, if you are working for an ISV with a system that will be installed at many customer sites, you don't have control of what collation there will be in different production environments. If you are oblivious about collations, you can run into some nasty surprises, particularly if you aim at an international market. Your development environment has a case-insensitive collation, and you and your colleagues sometimes spell the table Invoices just like that, sometimes you write invoices, and some of you may even prefer INVOICES. It all goes well, until you get your first Turkish customer. (Recall that in Turkish, there are both the dotted and the dotless I and i and I are never equal.) Now you have a mess to clear up.

In this chapter we will look at different strategies you can apply to prevent such gaffes. This includes a few options around collations we have not looked at this far.

Azure SQL Database

It could be that you decide that your system will only run on Azure SQL Database. (Presumably, this would be because you are providing a SaaS, Software-as-a-Service, solution.) In this case, you don't have to worry about the catalogue collation, since in Azure SQL Database the catalogue collation always defaults to SQL_Latin1_General_CP1_CI_AS, no matter which database collation you select.

It is possible to override this when you create the database through T‑SQL. In this case, you can say things like:

CREATE DATABASE MyDB COLLATE Finnish_Swedish_100_CS_AS_SC_UTF8 
(EDITION = 'YourChoice', SERVICE_OBJECTIVE = 'YourChoice')
WITH CATALOG_COLLATION = DATABASE_DEFAULT

This will result in a database where the catalogue collation is the same as the database collation, just like in regular SQL Server. This option is not exposed in the Portal, but only in T‑SQL and scripting options like Bicep and ARM Templates.

You may have to let the catalogue collation to be the same as the database collation, if you are moving a legacy application to Azure SQL Database. But if you are working with a new application, there is very little reason not to stick with the default of SQL_Latin1_General_CP1_CI_AS. Yes, as we saw previously, this is a very old collation with some funny quirks. But these funny quirks are with varchar, and the columns in the system catalogue are invariably in nvarchar. And for nvarchar, this collation is equivalent to Latin1_General_CI_AS. The only time this collation cannot serve you is if you want to use object or column names from scripts not supported by version-80 collations, like Telugu or Pinyin. Or if you want table names with emojis in them. :-)

It is worth noting that you cannot change the catalogue collation, once you have created the database. Or, at least, I have not found any ALTER DATABASE syntax to do this.

For the rest of the chapter, we will look at options when your database will exist in SQL Server on-prem, in an Azure VM, on Azure SQL Managed Instance, or in a non-Microsoft cloud. That is, everything that is not Azure SQL Database.

Mandate the Collation

One possible strategy is to say that your application requires the collation Latin1_General_100_CI_AS (or whichever you pick) and nothing else is supported. That is, you require customers to use the same collation that you use in your development environment.

I can't say that this is a strategy that I like. It certainly has the potential to make you come across as a very arrogant vendor. Then again, you know your system, and I don't. Maybe your application does not work a lot with string data, so there would not be very many places where the users experience data that is not correctly sorted according to their alphabet. Or maybe the user base for an installation is typically international, so there is never one single correct way to sort the data anyway. It may also be that you are sorting all data client-side according to the user's regional settings. That may still leave you with some rough edges when it comes to searching, but, as I said, you know your system, and I don't, so you are in a better position to tell whether you can get away with this strategy from the perspective of user experience.

But it is not only a matter of the relation to the users. If you want to mandate the collation to avoid problems with identifiers, you need to mandate both the database and the server collation. And the latter may put you at odds with the local DBA team. Because if the collation you mandate is one which is not normally used in that organisation, this means that they need to set up a separate SQL Server instance only for your application. Again, you know your system, and I don't. Maybe your product is a large enterprise-class application, and you have all reason to expect that customers will install it on a dedicated server, so mandating the server collation would not be a big deal. But if you have a smaller department-type of application which is likely to run on a shared instance, I don't think it is a good idea to require a specific server collation.

I may be pointing out the obvious, but if you are going to require a certain collation, and you are aiming at an international market, you are likely to get much better acceptance for a Latin1_General collation (including SQL_Latin1_General_CP1_CI_AS) than one tied to your local language.

Before I move on to the next alternative, I like to mention that there may be cases where you want to mandate a certain type of collation. If you want to explicitly support emoji's, you may want to call out that the application needs a surrogate-compatible collation. Or if you claim to support Unicode, but have decided to use varchar, then you obviously have to require a UTF‑8 collation. In these two examples, your concern is about data and not about metadata, so there is no reason to mandate the server collation, only properties of the database collation, which should be less of a concern of the local DBAs, since the database still can cohabit with other databases on the same instance, if this is desired.

Although there is a very important precondition here: you consistently make use of COLLATE DATABASE_DEFAULT in your temp-table definitions.

Develop Case-sensitive, Test Case-insensitive

This alternative is my preference, and it is also one I worked according to for many years when I was working for an ISV.

Note: the reason we made this choice was not based on international concerns; originally our product only targeted the Swedish market. But as a small company we felt that we could not mandate the sort order for the customer. So if the customer said "our SQL Server is case-sensitive", we would have no say. (This choice was made in the days of SQL Server 4.x, and the sort order was the one and same for the entire instance.) If you have written the code in a case-insensitive environment, the work to change the code to be consistent in case is a major undertaking. And an utterly boring one.

The idea is simple: in your development environments, you have a case-sensitive collation of your choice. This way, it will be you yourself who discover that you have written invoices for a table called Invoices and not your Turkish customer. And if you have a column called cString, you will catch it yourself when you write it cstring and not a customer in Hungary.

While I am at it: I also recommend that you have one case-sensitive collation on server level, and another on database level. This way, you will quickly notice when you forget to use COLLATE DATABASE_DEFAULT with your string columns in temp tables.

If you apply this strategy, you are likely to go mad after a while as you are trying to recall if you called that table ProductGroups, productGroups, Productgroups or something else. There is a simple remedy for this. In the place where I worked, all our names were lowercase only. Yes, initially it looks awkward to look at a name like instrumentclearingmarketplaces rather than InstrumentClearingMarketPlaces. But you get used to it quite quickly. (And you can use underscores to separate parts, if you like. We did not do that.)

If you use a case-sensitive collation, there is a small risk that you end up with two identifiers differing only in case, for instance mytable and MyTable. Or it could just be table aliases like this:

SELECT ...
FROM   Apricots A
JOIN   apples a ON A.col = a.col

Since the vast majority of your customers will prefer a case-insensitive collation (since that gives a better user experience when searching), you need a way to catch this internally. For this reason, I recommend that your test environments have a case-insensitive collation, so that you catch such gaffes before they make it to the customers.

What possibly could be a challenge here is that your developers mainly work with their LocalDB environments rather than a common development server. On LocalDB, the server collation is always SQL_Latin1_Genreal_CP1_CI_AS, and not all your developers may understand that they have to set a different collation for the database. In this situation, it may be better to let one of your test environments be case-sensitive. The important thing is that you need to have both case-sensitive and case-insensitive environments internally.

Even if you embrace this strategy, you can still run into accent-insensitive surprises, as least with Finnish_Swedish AI collations where v and w are considered to be equal. I have a fond memory of a forum post quite a few years ago where the poster (with an alias that suggested a non-Nordic origin) had a piece of code with both the variables @v and @w and he or she was confounded over the error message

Msg 134, Level 15, State 1, Line 49

The variable name '@w' has already been declared. Variable names must be unique within a query batch or stored procedure.

Presumably, the poster worked for an ISV whose software had been installed on an instance with Finnish_Swedish_CI_AI as the server collation for the first time.

In this particular case, the solution is easy to spot: Using one-letter variable names is not good for readability and is bad practice anyway. And the longer your names are, the smaller the risk for a clash.

Then again, what about this query and the error that goes with it:

CREATE DATABASE Finnish_Swedish_CI_AI COLLATE Finnish_Swedish_CI_AI
go
USE Finnish_Swedish_CI_AI
go
CREATE TABLE vendors (vendorid  int NOT NULL PRIMARY KEY,
                      name      nvarchar(50) NOT NULL)
CREATE TABLE watermarks (guid       uniqueidentifier NOT NULL PRIMARY KEY,
                         vendorid   int              NOT NULL,
                         issueddate date             NOT NULL)
SELECT w.guid, w.issueddate, v.name
FROM   watermarks w
JOIN   vendors v ON w.vendorid = v.vendorid

The error message is:

Msg 1011, Level 16, State 1, Line 97

The correlation name 'v' is specified multiple times in a FROM clause.

I am certainly guilty of using one-character aliases myself every now and then. And more specifically, I prefer short aliases to make my queries concise.

It is difficult to give general advice here. Then again, compared with the total mess you get when code developed case-insensitive meets a Turkish collation, this is a smaller problem. You can probably relatively easily sort it out, the day it happens. But if the Finnish and Swedish markets are important to you, you have reason to consider having a special test environment set up with an accent-insensitive Finnish_Swedish collation.

I am not aware of any accent-insensitive surprises you can run into. At least not, if you use English for your identifiers. If you have identifiers in your local language and there are characters beyond ASCII, another language's accent-insensitive collation can certainly bite you. (My personal preference is to always write code in English, since with today's global world, you never know who is going to work with your code in the future.)

In the next two sections, I will discuss two more alternatives for the sake of completeness, even if I am not inclined to recommend any of them.

Partially Contained Databases

Partially contained databases are a feature that was introduced in SQL 2012, but I don't think they have gotten much traction. The idea with a contained database is that it should be possible to move it from one SQL Server instance to another without having to bother about everything else on the server that it is dependent on: configuration options, logins, jobs etc. What shipped in SQL 2012 lives up to some of these ambitions, but certainly not all, hence the "partial" moniker.

Note: to be able to create a contained database, the configuration parameter contained database authentication must be set to 1.

The main allure for contained databases is that they permit authentication at the database level. Some sites use this to evade the need to sync logins between nodes in availability groups. (A better solution to this problem may be contained availability groups introduced in SQL 2022, where the AG has its own master and msdb that fail over with the rest of the AG.)

However, there is a property of contained databases that is interesting from a collation perspective. To wit, in a partially contained database, the catalogue collation is always the one and the same, Latin1_General_100_CI_AS_KS_WS_SC. Moreover, this collation applies also to temp-table names and variable names. That is, they do not follow the system collation as they normally do. To demonstrate, run this script on an instance with a case-sensitive server collation:

CREATE DATABASE default_collation
go
USE default_collation
go
CREATE TABLE MyTable(a int NOT NULL)
go
SELECT a FROM mytable
go
SELECT A FROM MyTable
go
CREATE TABLE #temp (a int NOT NULL)
CREATE TABLE #TEMP (A int NOT NULL)
go
DECLARE @a int
DECLARE @A int
go
DECLARE @T TABLE(a int NOT NULL,
                 A int NOT NULL)
go
PRINT 'Switching to contained database'
CREATE DATABASE contained CONTAINMENT = PARTIAL
go
USE contained
go
CREATE TABLE MyTable(a int NOT NULL)
go
SELECT a FROM mytable
go
SELECT A FROM MyTable
go
CREATE TABLE #temp2 (a int NOT NULL)
go
CREATE TABLE #TEMP2 (A int NOT NULL)
go
DECLARE @a int
DECLARE @A int
go
DECLARE @T TABLE(a int NOT NULL,
                 A int NOT NULL)
go
USE tempdb
go
DROP DATABASE contained
DROP DATABASE default_collation
DROP TABLE #temp
DROP TABLE #TEMP
DROP TABLE #temp2

If you have a case-insensitive server collation, it is unlikely that you will see any difference, so I will show you the output I get on my instance (which has a case-sensitive server collation):

Msg 208, Level 16, State 1, Line 7

Invalid object name 'mytable'.

Msg 207, Level 16, State 1, Line 9

Invalid column name 'A'.

Switching to contained database

a

-----------

 

(0 rows affected)

 

A

-----------

 

(0 rows affected)

 

Msg 2714, Level 16, State 1, Line 29

There is already an object named '#TEMP2' in the database.

Msg 134, Level 15, State 1, Line 32

The variable name '@A' has already been declared. Variable names must be unique within a query batch or stored procedure.

Msg 2705, Level 16, State 3, Line 37

Column names in each table must be unique. Column name 'A' in table '@T' is specified more than once.

That is, with the first database, I get errors when I refer to MyTable or its columns with a different case than in the definition, while this does not happen in the contained database. On the other hand, in the default_collation database, I can create both #temp and #TEMP and declare both @A and @a. I can also have a table variable with both the columns A and a. All these three operations fail in the contained database.

Thus, if you are aiming at an international market, you could rid yourself from all collation woes by mandating the customer to create the database as partially contained. You don't even have to think about applying COLLATE DATABASE_DEFAULT to the string columns in your temp tables, because with a contained database, the collation for string columns in temp tables defaults to the database collation and not the server collation.

Still, while this is great and fantastic, I find it difficult to recommend contained databases. The reason for this is that there are some features that are not supported in contained databases. The most important are Replication, Change Data Capture and Change Tracking. You may not be using any of them today, but you may find use for them tomorrow. Or your customers may want to use them to extract data into their own data warehouse etc. It seems unwise to me to lock yourself out from these fairly popular features.

The reason we have to live with this metadata-collation mess is the legacy from the days when there was only one sort order on the machine. Microsoft can't change this on a general basis, because that would break many applications. They were able to make this change with partially contained databases, since it was a new feature, so backwards compatibility was not a concern. Now, what I would like to see is an option that gives me only the collation pieces of contained databases, and therefore I have submitted a feedback item on the Azure Ideas site for this. Votes are welcome!

To learn more about contained databases, you can read https://learn.microsoft.com/en-us/sql/relational-databases/databases/contained-databases.

The CATALOG_COLLATION Option

We have already looked at the option to specify the catalogue collation in Azure SQL Database. Starting with SQL 2019, this option is also available in SQL Server itself, so that you can say:

CREATE DATABASE FixedMetadata WITH CATALOG_COLLATION = SQL_Latin1_General_CP1_CI_AS

Note that the only other choice for CATALOG_COLLATION is DATABASE_DEFAULT; you cannot pick any other collation. At first glance, this seems to be the solution that I called for at the end of the previous section, but it is even more difficult to recommend this option. There are three reasons for this:

Various Minor Topics

In this chapter we will explore a number of topics around collations with no particular relation to each other, but which do not warrant full chapters on their own. We will first look at more closely how the SQL Server Setup chooses the default for the server collation. Next, we will study three collation-related settings for linked servers. We will also learn how we can use collations to remove all non-ASCII letters. Lastly, we will look at how confusing it can be if you run a script that spans over multiple databases with different collations.

Selecting the Server Collation at Setup

Early in the article, I stressed the necessity to check what server collation Setup suggests as the default, so that you don't find yourself with a poor choice of collation later on. As I have mentioned, changing the database and/or server collation after the fact can be very painful. (For more details on how to do this, see the separate article Changing the Server and Database Collation in SQL Server.)

In this section we will first take a closer look at the Setup screens for collations, before we move to a survey of how SQL Server Setup selects the default collation based on the Windows system locale.

As you page through run the Setup Wizard, you will encounter the Server Configuration page which looks like this:

The Server Configuration page in the Setup Wizard. The Collation tab is highlighted.

Very unfortunate, in my opinion, Collation is on a secondary tab – I think it should be on a page on its own, so that the default collation is exposed to you. If you don't watch out, you will get a default collation that may not be appropriate for your users and applications. Therefore, you should always take a look at what is in the Collation tab, and be prepared to change the default.

When I run the Setup Wizard on my machine, I see this:

The Collation tab. It lissts to collation choices, one for Database Engine and one for Analysis Services.

If you did not opt to install Analysis Services (SSAS), you will not see that part of the screen. (I will not discuss Analysis Services further. The Customize screen for SSAS is similar to the one for the engine, but it has fewer choices, as SSAS supports fewer collations than the Database Engine.)

If you would happen to agree with the defaults, you can press Next> to move on. But you can also click Customize. This will take you to a screen with collation options, with the default collation filled in. Here is what I see on my machine:

The collation configuration page. There is one section for Windows collations where you can combine the various collation properties. In the lower part there is a list of SQL collations.

To the left, there is a main radio button to make the choice between Windows and SQL collations. If you select SQL collation, you will have to select the exact collection from the list box below that radio button. If you select Windows collation, it is far more sophisticated. First you select what they call a collation designator. Translated to the terminology I have used in this article, the collation designator includes both the collation family and the version number. For instance, just below Finnish_Swedish in the dropdown, you find Finnish_Swedish_100. Then you can use the check boxes to select the properties you want. Binary means _BIN and Binary-code point means BIN2; the others should be self-explanatory.

As you can see in the picture, some options are greyed out. That is because they are not available with the currently selected collation designator. The same goes for the box Char/Varchar Storage Options to the right. For a version-80 designator, the option UTF‑8 is unavailable. On the other hand, if you select Indic_General_90 or any other family that does not have a legacy code page, you will find that the radio button is locked on UTF‑8.

Or in short, this dialog knows what combinations that are possible and which are not, and you can't select an illegal collation.

If you are running Setup unattended, you can use the /SQLCOLLATION switch to set the collation, and you should always do this. Just like when you run the wizard, you should never rely on the default collation. You may think that you are comfortable with that default collation; however, the system locale may not be the one you think. You may have configured your user to use, say, Czech regional settings. But the system locale could still be English (United States) because you installed an English OS, and you never changed the locale.

Note: The place to do this is Control Panel->Region->Administrative->Language for non-Unicode programs:

Tab in Control Panel->Region where you can change the system locale.

One can't say that Language for non-Unicode programs is crystal clear, but the button to change the locale is. Note that a change of the system locale requires a system restart to take effect.

If you want to know which collation you can expect given your system locale, you can find a table in the section Server-level collations in a long topic on collations in the SQL Server docs. This table gives you the default collation for a given Windows locale. There are many entries in the table, and to get a better understanding of it, I copied the contents of the table to Excel and saved the result as a CSV file and loaded it into an SQL Server table with help of the BULK INSERT statement.

DROP TABLE IF EXISTS DefaultCollations
CREATE TABLE DefaultCollations 
            (WindowsLocale sysname NOT NULL,
             WLCID         varchar(10) NOT NULL,
             SQLLCID       varchar(10) NOT NULL,
             Collation     sysname NOT NULL,
             PRIMARY KEY (WindowsLocale, WLCID)
)
go
BULK INSERT DefaultCollations FROM 'C:\Temp\Book1.csv'
WITH (FIELDTERMINATOR =';', FIRSTROW=2, FORMAT= 'CSV', CODEPAGE =65001)

Note: I'm using semicolon as the field terminator here, since with Swedish regional settings, the semicolon serves as the field delimiter in CSV files. If you want to load the data yourself and your regional settings are different, you need to change the terminator.

To analyse the data, I ran this query:

; WITH CTE AS (
   SELECT CASE WHEN Collation LIKE 'SQL%'              THEN Collation 
               WHEN Collation = 'Latin1_General_CI_AS' THEN Collation
               WHEN Collation LIKE '%[_]140[_]CI[_]AS' THEN 'Windows_140_CI_AS'
               WHEN Collation LIKE '%[_]100[_]CI[_]AS' THEN 'Windows_100_CI_AS'
               WHEN Collation LIKE '%[_]90[_]CI[_]AS'  THEN 'Windows_90_CI_AS'
               WHEN Collation LIKE '%[_]CI[_]AS'       THEN 'Windows_CI_AS'
               ELSE Collation
         END AS Collation
    FROM   DefaultCollations
)
SELECT Collation, COUNT(*) AS "Count"
FROM   CTE
GROUP  BY Collation
ORDER  BY 2 DESC

This was the result when I got:

Windows_CI_AS, 147. Latin_General_CI_AI, 39; Not available at server level, 22; Windows_90_CI_AS, 3; Deprecated, not available at server level, 2; SQL_Latin1_General_CP1_CI_AS, 1

Note: as a result of the research for this article, I have submitted some Docs updates for this page, so if you run the query, the result could be somewhat different if my changes have been accepted.

If we disregard the Not available at server level for the moment, we can see that the default collation is always a Windows collation, except in one single case, but that is certainly not an insignificant one. As I mentioned in an earlier chapter, the default collation for system locale English (United States) is SQL_Latin1_General_CP1_CI_AS. There are a few servers out there with that locale. And not only in the United States, I suspect.

You may note that in no case, the default collation is a version-100 collation. Even less a UTF‑8 collation. But hang on. We will come back to that.

In the vast majority of the cases, the default collation is a case-insensitive, accent-sensitive version-80 collation, and about in one-third of the cases, the collation is Latin1_General_CI_AS, which is why I have singled out that collation in the query.

To look at the cases where the default collation is not a Latin1_General collation, I ran this query (the filter on collationproperty serves to filter out Not available on server level and that):

SELECT * 
FROM   DefaultCollations
WHERE  collationproperty(Collation, 'Name') IS NOT NULL
  AND  Collation NOT LIKE '%Latin1_General%'

Most of the choices look reasonable to me, although I need to add the disclaimer that I don't have the knowledge to vouch for each and every one of them. There is one case where the listed default collation does not seem the best to me and that is Tatar. The listed collation is Cyrillic_General_CI_AS, but there is both Tatar_90_CI_AS and Tatar_100_CI_AS. I note that in this case the SQL LCID is listed as 0x0444, which is the LCID for Tatar. It is possible that this is an error in the listing. (See more about this below.)

Another interesting observation is that there are three Windows locales for which default is a version-90 collation. They are Kazakh and Uzbek, which do not have any version-80 collation at all, and Macedonian, for which there is a deprecated version-80 collation.

Next, we turn to the system locales for which the default collation is given as Latin1_General_CI_AS. Many of these are correct. That is, they are for languages of which the rules coincide with those for English. There are also a couple for which I don't have knowledge enough to say aye or nay. But I see a few for which Latin1_General_CI_AS does not really fit.

Then there are 39 locales for which the default collation is Latin1_General_CI_AI. So, these are all languages that use the English alphabet, but for some reason needs accent-insensitivity? Not really. When I looked at that list, I directly noticed languages that don't use Latin script at all, such as Bashkir, Japanese and Persian to name a few. I also noted languages that use Latin script and which I know have alphabets that are different from the English alphabet: Norwegian, Sami, Serbian and Bosnian are some examples. The picture became clear when I ran a query to use the SQL LCID column to see if that value maps to a suitable collation. Since that column is in hex, and a leading zero is sometimes missing, I added a computed column the DefaultCollations table:

ALTER TABLE DefaultCollations ADD
   SQLLCIDint AS convert(int, convert(varbinary, 
     IIF(len(SQLLCID) = 7, stuff(SQLLCID, 3, 0, '0'), SQLLCID), 1))

Then I ran this query:

SELECT C.*, fh.name
FROM   DefaultCollations C
LEFT   JOIN sys.fn_helpcollations() fh ON 
           convert(int, collationproperty(fh.name, 'LCID')) = C.SQLLCIDint
        AND fh.name LIKE '%[_]CI[_]AS'
WHERE  C.Collation = 'Latin1_General_CI_AI'

In all cases, but one, the SQL LCID points to a version-100 collation. The exception is Inuktitut (Syllabics) Canada, for which there is no collation at all for that LCID. Seems like Microsoft does not like using version-100 as default collations!

Some of these choices seemed so outright silly, that I said to myself: It just can't be that bad and went on to do some testing to verify. On a VM I had around, I changed the system locale to the one I wanted to test, rebooted and ran the SQL Server Setup until I came to the Collation page. Here are the ones I tested:

For the remaining ones I did not test, I would expect the default to be Latin1_General_CI_AI in most cases after all. To wit, I sense a pattern. They are languages, for which there are version-100 collations, but there are no version-80 and version-90 collations. So why are not Microsoft using the version-100 collations as the default? The answer can be found if you scroll up past that table, and read the paragraph starting with During SQL Server setup, the default installation collation setting is determined by the operating system (OS) locale. You will find this passage: For backward compatibility reasons, the default collation is set to the oldest available version that's associated with each specific locale. Therefore, this isn't always the recommended collation.

Pay attention to the last sentence: Therefore, this isn't always the recommended collation. That is one more argument for why you should check the Collation tab when running Setup. Just because something is the default, does not mean that it is the best choice.

Let's now turn to the group Not available at server level. When I look at that list, I see a common pattern: all are locales that map to a collation family for which there was no varchar support before the introduction of UTF‑8 collations. (However, the reverse does not apply. For some of these locales, for instance Maltese and Māori, Latin1_General_CI_AI is listed as the default collation.)

Using the procedure, I described above, I tested with a couple of these system locales and here is the outcome:

As you can see, some of these actually have a UTF‑8 collation as the default collation. But not all. What is the pattern? The same as before. For Bengali/Bangla and Khmer there are only version-100 collations, and they are not eligible it seems.

Observe a very important thing: all my testing was done on SQL 2022. As it happens, I know that things were different in SQL 2019. In SQL 2019 the default for Hindi (India) was – Indic_General_90_CI_AS, which is not permitted as a server collation! (Since it does not support varchar). I became aware of this when I helped a few users on SQL forums who tried to install SQL Server with a Hindi locale, and Setup failed exactly for this reason. Thus, if your system locale is Hindi, Marathi etc, it is even more important that you check what is hiding on that Collation tab if you are installing an older version of SQL Server!

We have one group left: the two entries with Deprecated, not available at server level. OK, if these Windows locales are deprecated, it makes sense that there is no default. But, wait, why would both Azerbaijani locales be deprecated, with no other choice, I said to myself. It was time for a new test. And I found two default collations very apt for the Windows locales in question: Azeri_Cyrillic_100_CI_AS and Azeri_Latin_100_CI_AS. So finally, I found two locales with a version-100 collation as the default! I think I understand what has been going on here. There are also Azeri_Cyrillic_90 and Azeri_Latin_90, but both are deprecated and not returned by sys.fn_helpcollations.

There are two more entries in the table that look suspect to me, but which I have not tested, and that is Armenian and Georgian with the Windows LCID 0x0437. The default collation in both cases is Latin1_General_CI_AS, but the SQL LCID column says 0x0419 which suggests that the actual default collation is Cyrillic_General_CI_AS. These two languages use their own script, so there are no direct sorting conflicts with neither English nor Russian. Georgian has its own collations, and there is another Georgian LCID which defaults to a Georgian collation. There are no collations with Armenian in the name.

...and then there are quite a few languages for which their Windows locale is not present in the table at all. Presumably, the default in this case is Latin1_General_something. Which may or may not fit.

This section has given you an exposé of what you can run into if you ignore that Collation tab and just click Next> when you are done with the information on the front tab. That is, you may get something you like and want. Or you may get something which is not at all suitable for the data you intend to store. And in some cases, the default collation may even cause the SQL Server installation to fail. Finally, even if you look in that table and think you like the default collation listed, you should still check that tab. As I have pointed out, there is more than one incorrect entry in that table.

In this section, we have mainly focused on whether the default collation matches your language at all. To this comes the choice of version and all the other collation properties. We will come back to this later in the section Which Collation Should I Choose? in the final chapter.

Linked Servers and Collations

When you set up a linked server, there are three collation-related options you can configure: use remote collation, collation name and collation compatible. They are documented in the topic for sp_serveroption, but you cannot say that the text is crystal clear. This section is an attempt to explain what effect these options have. I use the word "attempt" quite deliberately, because I am not sure that I have fully understood everything myself. Particularly, my experience with linked servers for other data sources than other SQL Server instances is very limited.

This section includes some demos. They can possibly be a little bit difficult to follow, but you will have to bear with me. The result of the demos depends on the server collation, and I think starting a demo with saying "first install an instance of SQL Server with this collation" is not going to fly well. However, if possible, run this demo on an instance which does not have a UTF‑8 collation. With a UTF‑8 collation, some of the queries in the demo will fail as discussed below.

For the demos in this section, we need a database and a table.

CREATE DATABASE Greek_CI_AS COLLATE Greek_CI_AS
go
USE Greek_CI_AS
go
DROP TABLE IF EXISTS MyTable
CREATE TABLE MyTable
    (id       int                                        NOT NULL PRIMARY KEY,
     varname  varchar(128)  COLLATE Latin1_General_CI_AI NOT NULL UNIQUE,
     nvarname nvarchar(128) COLLATE Latin1_General_CI_AI NOT NULL UNIQUE)

INSERT MyTable (id, varname, nvarname)
   SELECT object_id, name, name
   FROM   sys.all_objects
   WHERE  schema_id = 4
   UNION ALL
   SELECT 0, N'résumé', N'résumé'

What's important in this setup is not that the database is Greek. What is of some importance is that the database collation has a different code page from the server collation. So if you want to play this demo and you have a Greek server collation, you may want to use, say, Arabic_CI_AS instead as the database collation.

When it comes to the table columns, the idea for the demo is that they have a collation with a different accent-sensitivity than the server collation. I'm using an accent-insensitive column collation, since I believe that AS collations are more commonly used. If you happen to have an accent-insensitive server collation, change the column collations to Latin1_General_CI_AS, if you run the demos yourself. This means that the queries in this section will produce the opposite results to what I show in the text, but it should still be consistent with the observations I'm making.

To keep the text simpler, for the rest of this section I am assuming that the server collation is an AS collation, and if you have an AI collation, you will have to hang on as well as you can.

Just to check that the setup is correct, you can run these queries:

SELECT * FROM MyTable WHERE varname  =  'resume'
SELECT * FROM MyTable WHERE nvarname = N'resume'

Both should return a row, since the collation is accent-insensitive.

Next, we need a linked server. For simplicity, we're setting this up as a loopback server:

IF EXISTS (SELECT * FROM sys.servers WHERE name = 'LOOPBACK')
   EXEC sp_dropserver LOOPBACK
EXEC sp_addlinkedserver 'LOOPBACK', '', 'MSOLEDBSQL', @datasrc = @@servername

SELECT uses_remote_collation, collation_name, is_collation_compatible 
FROM   sys.servers 
WHERE  name = N'LOOPBACK'

EXEC sp_serveroption LOOPBACK, 'use remote collation', 'true'
EXEC sp_serveroption LOOPBACK, 'collation name', NULL
EXEC sp_serveroption LOOPBACK, 'collation compatible', 'false'

That is, this linked server goes back to the instance itself. SQL Server is not smart enough to see through this, so for all practical purposes, this is the same as a linked server to a remote instance.

The script also explicitly sets the three collation-related options, but these settings are the default settings, as you can tell by reviewing the output from the query against sys.servers.

Now, we can try running these queries, with Include Actual Execution Plan enabled:

SELECT * FROM LOOPBACK.Greek_CI_AS.dbo.MyTable WHERE  varname =  'resume'
SELECT * FROM LOOPBACK.Greek_CI_AS.dbo.MyTable WHERE nvarname = N'resume'

Both queries return the row with résumé. The execution plan is the same for both:

Query plan for the queries above.

The fact that the Remote Query operator returns a single row despite that the table has many rows tells us that the WHERE condition was remoted as the term is. That is, the condition was evaluated on the remote side. This is often desirable for performance reasons.

Another query to test is this quite nonsensical one:

SELECT COUNT(*) 
FROM   MyTable M
JOIN   LOOPBACK.Greek_CI_AS.dbo.MyTable R ON R.nvarname = M.nvarname

The query completes without error and returns the number of rows in MyTable.

Let's now test what happens if we flip the setting of use remote collation:

EXEC sp_serveroption LOOPBACK, 'use remote collation', 'false'

We run that nonsensical query again:

SELECT COUNT(*) 
FROM   MyTable M
JOIN   LOOPBACK.Greek_CI_AS.dbo.MyTable R ON R.nvarname = M.nvarname

This time there is an error message:

Msg 468, Level 16, State 9, Line 45

Cannot resolve the collation conflict between "Latin1_General_CI_AI" and "Finnish_Swedish_100_CS_AS_SC" in the equal to operation.

The name of the second collation in the message will be your server collation.

That is, the meaning of use remote collation is that when true, SQL Server should use the collation of the columns in the remote table. When false, on the other hand, SQL Server handles the remote data as if its collation is the server collation of the local instance.

Let's now run these lookup queries again:

SELECT * FROM LOOPBACK.Greek_CI_AS.dbo.MyTable WHERE  varname =  'resume'
SELECT * FROM LOOPBACK.Greek_CI_AS.dbo.MyTable WHERE nvarname = N'resume'

Both queries complete successfully (as long as you don't have UTF‑8 collation on server level). The second query still returns the row with résumé, whereas the first does not. Furthermore, the query plan for the second query is the same as before. That is, the WHERE clause is evaluated on the remote side.

However, the query plan for the query with varname is different:

Query plan for the query with varname

The Remote Query operator returns all rows, and the WHERE clause is now implemented with a Filter operator on the local side, using the server collation. Since this is an accent-sensitive collation, resume does not match résumé, and thus the query returns no rows. This appears quite logical: We have told SQL Server to ignore the remote collation and use the local server collation instead. However, if this was the rationale, nvarname should be treated in the same way, and it is not.

I am not sure, but my assumption is that this is related to code pages. With use remote collation set to false, we want the condition to be evaluated according to the code page of the local collation rather than the one of the remote collation. Never mind, that in this example the code page is the same on both sides; apparently the mere possibility seems to compel SQL Server to evaluate the condition locally. With nvarchar there are never any issues with code pages, which would explain why remoting still occurs here.

If your server collation is a UTF‑8 collation, both SELECT queries fails with this error:

Msg 7339, Level 16, State 1, Line 59

OLE DB provider 'MSOLEDBSQL' for linked server 'LOOPBACK' returned invalid data for column '[LOOPBACK].[Greek_CI_AS].[dbo].[MyTable].varname'.

Apparently, the remote data is not converted to the local code page, but the raw bytes are received as-is. In UTF‑8, bytes with the high bit set always come in pairs, triplets or quadruplets (see the discussion in the UTF‑8 section), so single high-bit bytes like the és in résumé are illegal as UTF‑8.

You may ask what is the use case for setting use remote collation to false? The use case would be to avoid collation conflicts with other systems. Given the fact that we got a collation error in the demo when we changed the setting to false, this may seem contradictory. But in the demo, we are purposely using a database collation which is different from the server collation. In many cases in real life, they are the same. On the other hand, in real-life situations, you rarely set up loopback servers, but you set up linked servers to other SQL Server instances, and they may be using different collations than your local SQL Server instance. As long as your database and server collations are the same, you can resolve this situation by setting use remote collation to false. But note that this is at the price that remoting is prevented, leading to that performance suffers. So it may be better to stay with the default of true and cast the collation where it is actually needed. In some cases, the best may be to cast a local column to the remote collation to facilitate remoting.

There is one more aspect, though. That remote database may actually use the same collation as your database. But that database is not under your control, and someone could decide to change the collation on the remote database. Given how difficult that is, the risk is not that high. But, nevertheless, if it happens, things break on your side and you have an outage. Yes, once you have understood what is going on, you can flip the setting for use remote collation. But now you may get performance issues, because remoting no longer happens, and sorting out those issues is likely to be non-trivial.

That is, setting use remote collation to false can be a defensive measure in advance. After all, there are many other things that prevent remoting. For this reason, I tend to shy away from referring to remote tables in plain four-letter form, but instead I use various tricks to enforce execution on the remote side when I think this is desirable. (For some examples of such tricks, see the section Accessing Remote Data Sources in my article The Curse and Blessings of Dynamic SQL.)

When it comes to the other two options, collation name and collation compatible, I have not been able to detect any effect when playing with these options, as long as the linked server is an SQL Server instance set up in the normal way. I guess that this is because SQL Server knows that the other instance is also SQL Server, and in that case, it does not need any guidance from these server options

It's a different matter if the linked server is something else than SQL Server. Now, this is a little difficult to demo, since I can't assume that you have MySQL, Postgres, Oracle or whatever around. So we will do this demo with a remote data source that is our local SQL Server instance, but accessed over ODBC with the MSDASQL provider like this:

IF EXISTS (SELECT * FROM sys.servers WHERE name = 'LOOPBACK_ODBC')
   EXEC sp_dropserver LOOPBACK_ODBC
DECLARE @connstr nvarchar(200) = 
        N'Driver={ODBC Driver 17 for SQL Server};' + 
        N'Server=' + @@servername + ';' +
        N'Trusted_connection=yes'
EXEC sp_addlinkedserver 'LOOPBACK_ODBC', '', 'MSDASQL', @provstr = @connstr

SELECT uses_remote_collation, collation_name, is_collation_compatible 
FROM   sys.servers 
WHERE  name = N'LOOPBACK_ODBC'

EXEC sp_serveroption LOOPBACK_ODBC, 'use remote collation', 'true'
EXEC sp_serveroption LOOPBACK_ODBC, 'collation name', NULL
EXEC sp_serveroption LOOPBACK_ODBC, 'collation compatible', 'false'

Important note: While this is a neat trick for testing, you should not use MSDASQL + ODBC for linked servers to SQL Server in production, but you should use the MSOLEDBSQL provider or SQLNCLI11 on older versions.

Since we are using MSDASQL rather than one of OLE DB providers that are dedicated for SQL Server, SQL Server does not realise that it is talking to itself, but considers this to be a "general ODBC data source".

Note: Depending on what ODBC drivers you have installed, you may need to change the driver in the connection string. To see what drivers you have installed, see the ODBC Data Sources tool in Windows.

Again, I am setting the three collation options explicitly, but the output from the query against sys.servers tells us that these are the default settings.

Let's run the lookup queries again, with Include Actual Execution Plan enabled:

SELECT * FROM LOOPBACK_ODBC.Greek_CI_AS.dbo.MyTable WHERE  varname =  'resume'
SELECT * FROM LOOPBACK_ODBC.Greek_CI_AS.dbo.MyTable WHERE nvarname = N'resume'

The result is the same as above with use remote collation set to false. That is, the first query does not return any row, and the query plan includes a local Filter operator. The second query returns a row, and the Remote Query operator returns a single row, telling us that the WHERE condition was remoted. When it comes to the nonsensical join query:

SELECT COUNT(*) 
FROM   MyTable M
JOIN   LOOPBACK_ODBC.Greek_CI_AS.dbo.MyTable R ON R.nvarname = M.nvarname

We get the error message about collation conflict between Latin1_General_CI_AI and the server collation.

Thus, it seems that despite the setting, SQL Server is handling this linked server as if use remote collation is set to false. But it is not really so. The difference is that SQL Server has no single idea about the collation on the remote side. It thinks "this linked server is something else than SQL Server, and other products have different names for their collations that I don't understand."

This is where the setting collation name comes into play. We need to tell SQL Server what the collation is like on the other side:

EXEC sp_serveroption LOOPBACK_ODBC, 'collation name', 'Latin1_General_CI_AI'

We run these queries again:

SELECT * FROM LOOPBACK_ODBC.Greek_CI_AS.dbo.MyTable WHERE  varname =  'resume'
SELECT * FROM LOOPBACK_ODBC.Greek_CI_AS.dbo.MyTable WHERE nvarname = N'resume'
go
SELECT COUNT(*) 
FROM   MyTable M
JOIN   LOOPBACK_ODBC.Greek_CI_AS.dbo.MyTable R ON R.nvarname = M.nvarname

The first two queries both return the row with résumé and both have query plans where condition is remoted. The nonsensical join query now completes without a collation error.

This is a setting that requires some care. If you are querying many different tables on the remote data source, and these tables have different collations, you may get funky results, particularly if multiple code pages are in play. Here is a silly example:

EXEC sp_serveroption LOOPBACK_ODBC, 'collation name', 'Ukrainian_CI_AS'
go
SELECT * FROM LOOPBACK_ODBC.Greek_CI_AS.dbo.MyTable WHERE  varname =  'resume'
SELECT * FROM LOOPBACK_ODBC.Greek_CI_AS.dbo.MyTable WHERE nvarname = N'resume'

SQL Server gladly remotes the WHERE condition for both queries, and despite that we said that the collation was accent-sensitive, we get back a row. But it looks funny:

Output from one of the queries above. The accented e has been replaced by a cyrillic character

The é has been replaced by Cyrillic character, which presumably is at the same code point in CP1251, the Cyrillic code page, that é has in CP 1252. Well, we asked for it.

If you run the nonsensical query, you will find that it dies with a collation conflict between Latin1_General_CI_AI and Ukrainian_CI_AS.

It's not that the setting use remote collation does not matter. If you run:

EXEC sp_serveroption LOOPBACK_ODBC, 'use remote collation', 'false'

and then run the three queries again, you will find that the result is the same as prior to when we set collation name. That is, this setting is only considered when use remote collation is true.

Left to explore is collation compatible:

EXEC sp_serveroption LOOPBACK_ODBC, 'collation compatible', 'true'
go
SELECT * FROM LOOPBACK_ODBC.Greek_CI_AS.dbo.MyTable WHERE  varname =  'resume'
SELECT * FROM LOOPBACK_ODBC.Greek_CI_AS.dbo.MyTable WHERE nvarname = N'resume'

Both queries now return the row with résumé and in both cases the WHERE clause is remoted. On the other hand, the nonsensical join query still dies with a collation conflict between the Latin1_General_CI_AI and the server collation.

So it seems that "collation compatible" means "code-page compatible". That is, we are telling SQL Server that it can rely on that non-Unicode data on the remote instance to have the same code page as on the local instance and therefore it can remote conditions on varchar columns as well.

While this was intended to show what happens when you have a linked server that is not SQL Server, it is not the full story. I also ran a test against a MySQL database, using the MSDASQL provider in combination with the Unicode ODBC driver for MySQL. The results were not the same as for LOOPBACK_ODBC. Specifically, no matter the setting of these options, SQL Server never remoted the condition for the WHERE clause to MySQL. This had the effect that when I set collation name, the WHERE clause was evaluated according to the collation I had specified and not the collation on the MySQL side. Which you could argue is a semantically more correct behaviour. But I can't say that I am comfortable with the result of a query depending on whether the optimizer decides to remote a condition or not.

Why remoting did not happen in the case of MySQL, I don't know, but compilation of distributed queries is a complex interplay where SQL Server interrogates the OLE DB provider about its capabilities. And in the case of MSDASQL, it has to pass the bucket to the ODBC driver. If there is any sort of doubt, SQL Server needs to play it safe, and decline to remote. Thus, the reason that remoting did not occur in my MySQL test could be due to something else than collations.

While this section hopefully has given you some ideas what purpose these options serve, I doubt that you have gotten full clarity, least of all in the case where the remote data source is something else than SQL Server. If you want to know what happens with character data with your remote data source, you need to do some experimentation on your own

Finally, you may still wonder what the Greek database collation had to do with it. Not a whole lot, but the idea was to illustrate that it is the server collation that matters in some situations, but the database collation does not.

A Trick with Fallback Characters

Something I see every once in a while on forums is that people want to remove "special characters". After asking what they really mean, it turns out they want to retain the characters that are the most special of them all, those that are the primus inter pares. That is, they only want to keep characters in the ASCII set. (And often only letters and digits.) It's the more plain characters outside that set, like Å, Ø, Ħ to name a few, they want to get rid of. Personally, I hate this. I live on Rörstrandsgatan, and if you try to tell me that it can be spelled without that second letter, I say, sure, just like the capital of the USA is Vashington. (I have told you that in Swedish V and W are equivalent, didn't I?) But I digress.

One way to do this is to use the translate function (introduced in SQL 2017). Here is an example on how to reduce an ASCII text to only use the letters of Classical Latin, removing the later additions J, U and W:

DECLARE @string nvarchar(50) = 'The quick brown fox jumps over the lazy dog'
SELECT translate(@string, 'juw', 'ivv')

The second and third arguments to translate should be strings of equal length. The first character in the first string is replaced by the first character in the second string and so on.

This can be a bit tedious to set up, particularly if you don't know exactly what characters you may encounter. But say that you mainly expect letters in the Latin script that you want to replace with their ASCII lookalikes. Here is some sample data: (this script is also in the script fallbacktrick.sql in the collationscripts.zip file).

USE Latin1_General_CI_AS
go
DROP TABLE IF EXISTS words
CREATE TABLE words (word     nvarchar(20) NOT NULL,
                    original nvarchar(20) NOT NULL,
                    CONSTRAINT pk_words PRIMARY KEY (word))
go
INSERT words (word, original)
  SELECT word, word
  FROM   (VALUES(N'Ålborg'),     (N'Nykøbing F'), (N'Łódź'), 
                (N'räksmörgås'), (N'dziękuję'),   (N'miłość'), 
                (N'återvända'),  (N'Hellebæk'),   (N'Tōkyō'), 
                (N'życie'),      (N'Zürich'),     (N'noël')) AS V(word)

go
SELECT word FROM words

The data is a mix of place names and other words from Danish, French, German, Japanese, Polish and Swedish. Here is an image of the words:

The words inserted in the script above.

It may at first not seem apparent how collations could help us here. The trick is to first convert the data to varchar. In this process, SQL Server will convert any character not available in the code page to a fallback character, which in many cases is an ASCII lookalike. So we can run:

SELECT convert(varchar(30), word), word FROM words

Which gives us this:

Result of the SELECT above. Some non-ASCII characters have been replaced, but not all.

Most of the non-ASCII letters in the Polish and Japanese words have been replaced, but letters that are available in Latin‑1 are still there.

What we can do now is to cast to a collation with a different code page, suggestively one for a language that does not use the Latin script. Greek is one option. So let's try:

UPDATE words
SET    word = convert(varchar(30), word COLLATE Greek_CI_AS)
SELECT word, original FROM words

We get this:

Output after the UPDATE. It's all ASCII now.

We can claim success, almost. The Danish letter æ was replaced with a question mark, the general fallback character. The preferred result may be ae, but I believe that fallback characters are always one-to-one, so one character is never replaced by two. (Which kind of makes sense, as that could cause issues with string truncation.).

This is certainly a brutal method, and you may get replacements you don't fancy. You could try some other collation and see if this works better. You can even make multiple collation casts. For instance, I find that when I first cast to Polish_CI_AS, Hellebæk becomes Hellebak. And you can of course also use translate for the conversions that are important to you. Although, translate also replaces characters on a one-by-one basis. To change æ into ae, you would need to use the replace function.

Multi-Database Scripts

Normally, when I write a script that is doing work in more than one database, I put the USE statement to switch databases in a batch of its own, like this:

-- Last statement for ThisDatabase here.
go
USE NextDatabase
go
-- First statement in NextDatabase here.

Partly, this is out of habit. I was taught to do it this way when I first started to use SQL Server, and, as I recall, it was a requirement then, in SQL Server 4.x. But it also brings a feeling of safety, since it is not apparent what the binding rules might be if you change databases half-way through the script. Or what the collation rules might be if you move between different databases with different collations.

However, it is permitted to have USE right in the middle of a batch to switch databases, and sometimes you may want to do this. The chief reason might be that you have values in scalar variables that you want to use in both databases, and you feel that the workaround of storing the values in temp tables or session context is too bulky. Therefore, in this section we will look at what happens when the databases have different collations.

First a recap of what we have learnt previously:

Will we have to modify any of these rules for multi-database batches?

For this exercise, we will work with two databases that we have worked with before. But in case you dropped any of them, or never created them, I give the CREATE statements for both:

CREATE DATABASE Japanese_CI_AS COLLATE Japanese_CI_AS
go
CREATE DATABASE Czech_100_CS_AS_SC COLLATE Czech_100_CS_AS_SC

To note is that these two collations use different code pages for varchar, and that one of them is case-insensitive, whereas the others consider upper- and lowercase to be different.

We will play with this script that you also find in the script Multi-Database.sql in the collationscripts.zip archive.

USE Japanese_CI_AS
go
DECLARE @UPPER      nvarchar(20) = N'NAKAMEGURO',
        @lower      nvarchar(20) = N'nakameguro',
        @Nakameguro varchar(20)  = N'中目黒',
        @boy        nvarchar(20) = N'chlapec',
        @girl       nvarchar(20) = N'dĕvĉátko'

SELECT @Nakameguro AS "Nakameguro",
       IIF(@UPPER = @lower, N'Equal', N'Not Equal') AS CaseSensitivity,
       IIF(@boy < @girl, N'Boys first', N'Girls first') AS Alphabet,
       IIF(N'chlapec' < N'dĕvĉátko', 
                          N'Boys first', N'Girls first') AS Alphabet3

USE Czech_100_CS_AS_SC

DECLARE @ČeskéBudějovice varchar(20)  = N'České Budějovice',
        @Nakameguro2     varchar(20)  = N'中目黒',
        @CAPS            nvarchar(20) = N'PRAHA',
        @minis           nvarchar(20) = N'praha',
        @boy2            nvarchar(20) = N'chlapec',
        @girl2           nvarchar(20) = N'dĕvĉátko'

SELECT @Nakameguro AS Nakameguro, @Nakameguro2 AS Nakameguro2, 
       @ČeskéBudějovice AS ČeskéBudějovice

SELECT IIF(@UPPER = @lower, N'Equal', N'Not Equal') AS CaseSensitivity,
       IIF(@CAPS  = @minis, N'Equal', N'Not Equal') AS CaseSensitivity2,
       IIF(@boy < @girl,   N'Boys first', N'Girls first') AS Alphabet,
       IIF(@boy2 < @girl2, N'Boys first', N'Girls first') AS Alphabet2,
       IIF(N'chlapec' < N'dĕvĉátko', 
                           N'Boys first', N'Girls first') AS Alphabet3

Screenshot of the script above.

Before we look at the output, just an explanation of the contents. You may recognise the Japanese characters by now: they read Nakameguro, a district in Tokyo. Chlapec is the Czech word for boy, while dĕvĉátko is Czech for girl. Praha is the Czech name for Prague. České Budějovice is a city in the south of Czechia. You should know that in Czech, ch is considered to be a letter of its own and it sorts between h and i (sic!) in the Czech alphabet.

Also, pay attention to the variables for Nakameguro and České Budějovice: they are varchar and thus dependent on the code page of the collation. Remaining variables are nvarchar.

Here is the output:

Output from the script above.

The first result set is not very startling. We see the Japanese characters correctly, as expected since we are in a Japanese database. Uppercase and lowercase are equal, again expected since we are in a CI database. And C sorts before D, like it normally does.

In the second result set, we have three place-names variables. One declared in the Japanese database, and two declared in the Czech database. But no matter where the variable is declared, the Japanese characters are presented correctly. On the other hand, in the Czech name, the three non-ASCII letters are replaced by fallback characters. So it seems that the collation of all three variables is Japanese_CI_AS.

The third result set confirms this conclusion. Both CaseSensitivity columns return Equal, despite the latter pair being declared in the Czech part of the script. And the same applies to Alphabet and Alphabet2 which both reflect the Japanese (or general Latin) order. On the other hand, Alphabet3, with string literals, now reflects the Czech alphabet.

When we submitted the script, the current database was Japanese_CI_AS, so a possible conclusion is that the variables assume the collation of the database the script originally was compiled in, whereas string literals are evaluated at run-time. However, with regards to the assumption about variables, it is not correct. Change the top of the script to read:

USE Czech_100_CS_AS_SC
go
USE Japanese_CI_AS
DECLARE @UPPER      nvarchar(20) = N'NAKAMEGURO',
        @lower      nvarchar(20) = N'nakameguro',

That is, the script is now submitted with Czech_100_CS_AS_SC as the current database. If our assumption is correct, the output will change. However, it remains exactly the same.

Hm, maybe there is a special rule if the script starts with a USE statement? No, insert some statements before the USE in the multi-database batch, for instance:

USE Czech_100_CS_AS_SC
go
DROP TABLE IF EXISTS #temp
CREATE TABLE #temp(a int NOT NULL)

SELECT COUNT(*) FROM sys.objects

SELECT IIF(N'chlapec' < N'dĕvĉátko', N'Boys first', N'Girls first') AS Alphabet3

USE Japanese_CI_AS
DECLARE @UPPER      nvarchar(20) = N'NAKAMEGURO',

There are two more result sets (with expected output), but the original three are the same as before.

But throw in a variable declaration before the USE:

USE Czech_100_CS_AS_SC
go
DECLARE @a int 
USE Japanese_CI_AS
DECLARE @UPPER      nvarchar(20) = N'NAKAMEGURO',

Now the output changes:

Altered output where the Czech collation wins.

The Japanese characters are now replaced by fallback characters, whereas České Budějovice is presented correctly. Comparisons are now case-sensitive, and all Alphabet columns with variables respect the Czech alphabet. So it seems that the rule is that all variables take the collation of the database where the first variable is declared.

Although, this needs some qualification. Here is something really crazy:

USE Czech_100_CS_AS_SC
go
IF 1 = 0
   USE Japanese_CI_AS
DECLARE @UPPER      nvarchar(20) = N'NAKAMEGURO',

In this case, the script executes entirely in the Czech database, but if you run this, you can see that the variables follow the Japanese CI collation. I would guess that SQL Server ignores the IF statement when parsing the script and considers Japanese_CI_AS to be the first database. (If you really write scripts where you conditionally change databases, don't blame me if you end up in a state of total confusion!)

Given these rules, we could have a situation where we combine a variable and a literal with different collations. For instance:

USE Japanese_CI_AS
go
DECLARE @var  nvarchar(20) = 'value'
USE Czech_100_CS_AS_SC
IF @var = 'whatever'
   PRINT 1

What happens now? Answer: it ends in tears. That is, there is a collation conflict:

Msg 468, Level 16, State 9, Line 5

Cannot resolve the collation conflict between "Czech_100_CS_AS_SC" and "Japanese_CI_AS" in the equal to operation.

Finally, let's look at table variables. Below is a script. Note that this time we start in the Czech database.

USE Czech_100_CS_AS_SC
go
DECLARE @T1 TABLE (str nvarchar(20) PRIMARY KEY)

INSERT @T1(str) VALUES (N'Praha'), (N'PRAHA')

USE Japanese_CI_AS

DECLARE @T2 TABLE (str nvarchar(20) PRIMARY KEY)

INSERT @T1(str) VALUES (N'Brno'), (N'BRNO')
INSERT @T2(str) VALUES (N'Osaka'), (N'OSAKA') 

From what we saw earlier, it seems reasonable to assume that the columns of both @T1 and @T2 would assume the Czech collation, which is case-sensitive, and thus all inserts would succeed. However, this is the output:

(2 rows affected)

 

(2 rows affected)

 

Msg 2627, Level 14, State 1, Line 13

Violation of PRIMARY KEY constraint 'PK__#B59FDEB__DDDF32887B03D380'. Cannot insert duplicate key in object 'dbo.@T2'. The duplicate key value is (OSAKA).

The statement has been terminated.

So the default collation for the columns in a table variable seems to follow the collation of the database where the table variable is declared.

I need to close this section with some caveats. I have not researched if there is any documentation that describes this behaviour. Nor can I say whether what you have seen here is due to a careful design or just to happenstance. Furthermore, I cannot exclude that there is some quirk that I have missed that changes our conclusions about the rules. If I am to give one recommendation it is: don't run multi-database batches across databases with different collations unless there is a very good reason to.

How the Collation Affects Performance

As the title says, in this chapter we will study the effect of collations on performance. The first two sections are both essential reading, since they describe two situations where the impact from the collation is highly significant. The rest of the chapter is devoted to data from my performance tests of all the official collations in SQL 2022. While there are some interesting observations, not the least regarding the UTF‑8 collations, it is not essential reading in the same manner as the first two sections, and if you get bored by all the numbers that I show, feel free to move on the last section in this chapter with a summary of my findings.

The Data-Type Mismatch Accident

For this section we will use a table called lookuptest, which we create and populate in this way:

DROP TABLE IF EXISTS lookuptest
CREATE TABLE lookuptest (
    ident    bigint       NOT NULL IDENTITY,
    Windows_nvarchar nvarchar(60) 
          COLLATE Latin1_General_CI_AS NOT NULL,
    SQL_nvarchar     nvarchar(60) 
          COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    Windows_varchar  varchar(60)  
          COLLATE Latin1_General_CI_AS NOT NULL,
    SQL_varchar      varchar(60)  
          COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    some_number      int                       NOT NULL,
    filler char(200) NOT NULL CONSTRAINT default_data DEFAULT ' ',
    CONSTRAINT pk_lookuptest PRIMARY KEY (ident)
)

; WITH manywords AS (
    SELECT TOP (100000) 
           left(IIF(f.flip = 1, a.name, reverse(a.name)) + 
                IIF(f.flip = 1, b.name, reverse(b.name)), 60) AS word
    FROM        (SELECT DISTINCT name FROM sys.all_columns) a
    CROSS  JOIN (SELECT DISTINCT name FROM sys.all_columns) b
    CROSS  JOIN (VALUES(1), (2)) AS f(flip)
    ORDER  BY newid()
)
INSERT lookuptest (Windows_nvarchar, SQL_nvarchar, Windows_varchar, 
                   SQL_varchar, some_number)
   SELECT word, word, word, word, rand() * 1000000
   FROM   manywords

CREATE INDEX Windows_nvarchar_ix ON lookuptest(Windows_nvarchar)
CREATE INDEX SQL_nvarchar_ix ON lookuptest(SQL_nvarchar)
CREATE INDEX Windows_varchar_ix ON lookuptest(Windows_varchar)
CREATE INDEX SQL_varchar_ix ON lookuptest(SQL_varchar)

That is, the table has four string columns of which two are nvarchar columns, one with a Windows collation and one with an SQL collation, and then there is an equivalent pair of varchar columns. We add an index on all these four columns. We add the index after populating the data to make the indexes "perfect". There is an IDENTITY column, which produces the clustered index. There is an integer column with a random number, and finally there is a filler column to make the rows wider. We fill this table with some random strings produced from the system catalogue. These strings may not be unique, but there should not be a lot of duplicates.

What we want to test is the performance of a lookup operation like this one:

SELECT @t = some_number FROM lookuptest WHERE col = @val

Because such a query is very quick, we cannot just run the query once. Instead we run the query in a loop for 500 ms, and count how many iterations we make:

SET NOCOUNT ON
go
DECLARE @t int,
        @i int = 0,
        @d datetime2 = sysdatetime()

WHILE datediff(ms, @d, sysdatetime()) < 500
BEGIN
   SELECT @t = some_number FROM lookuptest WHERE Windows_nvarchar = N'namename'
   SET @i += 1
END
SELECT @i AS [No of iterations]

When I ran it on my machine, I got 66 500 iterations.

I should point out that the variable assignment serves an important purpose here: We don't want to return a result set to the client, since the time it takes SSMS to render all those result sets would seriously distort the test.

Before we move on, enable display of the actual execution plan and run (we don't need the variable assignment here):

SELECT some_number FROM lookuptest WHERE Windows_nvarchar = N'namename'

This is the query plan:

Query for the query above

As we can expect, it is an Index Seek + Key Lookup.

Disable the display of the execution plan, replace Windows_nvarchar with SQL_nvarchar, and run the script again:

SET NOCOUNT ON
go
DECLARE @t int,
        @i int = 0,
        @d datetime2 = sysdatetime()

WHILE datediff(ms, @d, sysdatetime()) < 500
BEGIN
   SELECT @t = some_number FROM lookuptest WHERE SQL_nvarchar = N'namename'
   SET @i += 1
END
SELECT @i AS [No of iterations]

I got 67 802 iterations. Re-enable the display of the execution plan again and run a single execution and look at the plan. You will find that it is identical to the one above. Only the name of the index is different.

The result is by no means surprising. As I have noted earlier, for nvarchar, these two collations are identical.

Again, make sure that you have disabled the display of the execution plan and now run the loop for the column Windows_varchar. My result was 28 091 iterations. That is, a noticeable degradation of performance. Why does this happen? As the title of this section suggests, we have a data-type mismatch. We have a varchar value that is compared to an nvarchar value. Whenever two types meet in an expression, SQL Server will apply its data type precedence and attempt to convert the type of lower precedence to the other type. varchar has lower precedence than nvarchar, so this means that the column Windows_varchar is converted to nvarchar. As you may guess, this has an effect on the query plan. Enable display of execution plan and run

SELECT some_number FROM lookuptest WHERE Windows_varchar = N'namename'

This is how it looks like:

Query plan for the previous query

The plan is more complex than the previous one, but if you look a little closer, you can see that there is still a Seek on the index on the column Windows_varchar. Recall that for Windows collations, all operations are carried out in UTF‑16. For a Windows collation, varchar is an extract of the characters in nvarchar. Not merely a subset, but an ordered subset. Whence, an Index Seek is still possible. However, SQL Server feels compelled to add a few extra operators compared to the plan for nvarchar, and that's why this query is less efficient. In fact, the execution time more than doubled. While this sounds alarming, keep in mind that these lookups are very swift operations, and if this accident would occur in a live system, it may not even be noticeable. And more importantly, if the table would be thousandfold in size or more, the performance cost for the type mismatch would be about the same, a cost of a factor two to three.

Again, make sure display of execution plan is disabled, and then try the loop with the last column, SQL_varchar:

SET NOCOUNT ON
go
DECLARE @t int,
        @i int = 0,
        @d datetime2 = sysdatetime()

WHILE datediff(ms, @d, sysdatetime()) < 500
BEGIN
   SELECT @t = some_number FROM lookuptest WHERE SQL_varchar = N'namename'
   SET @i += 1
END
SELECT @i AS [No of iterations]

When I ran it, I got 48 iterations. Forty-eight. Or more than hundred times slower than with an nvarchar column. What is the query plan this time?

Query plan for the query SELECT some_number FROM lookuptest WHERE SQL_varchar = N'namename'

It's radically different. There is not an Index Seek but an Index Scan. The reason is that for SQL collations, the rules for varchar are not the same as for nvarchar. You may recall that when we looked at this collation, SQL_Latin1_General_CP1_CI_AS, how punctuation characters come in different order in varchar and nvarchar, and the letter Š is not handled as a letter in varchar. Therefore, the organization of the index is of no use after the conversion to nvarchar. A Seek is not possible, only a Scan is. This also means that there is no upper limit for the penalty for the data-type mismatch. The bigger the table, the bigger the disaster.

You could argue that this is no cause for alarm, since the root cause here is a programming mistake. That is true. However, it is a mistake that is by no means uncommon. The DBA decides that all columns should be varchar to "save space", while the application developers decide to use an ORM that defaults to using Unicode and nvarchar. Add to this that some programmers overall are sloppy and use varchar, nvarchar and N before literals with no real understanding of what they are doing. So I would still say this is quite a strong argument for staying away from SQL collations.

Nevertheless, in the next section, the SQL collations will take revenge.

LIKE '%abc%'

In this section we will work with a table with some similarities to the table in the previous section:

DROP TABLE IF EXISTS guids
CREATE TABLE guids (
    ident            bigint  NOT NULL IDENTITY,
    Windows_nvarchar nvarchar(200) 
          COLLATE Latin1_General_CI_AS NOT NULL,
    SQL_nvarchar     nvarchar(200) 
          COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    Windows_varchar  varchar(200)  
          COLLATE Latin1_General_CI_AS NOT NULL,
    SQL_varchar      varchar(200)  
          COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
)
go
; WITH guidstrings AS (
   SELECT TOP (500000) 
          concat_ws(' ', newid(), newid(), newid(), newid()) AS str
   FROM   sys.all_columns a
   CROSS  JOIN sys.all_columns b
)
INSERT guids (Windows_nvarchar, SQL_nvarchar, Windows_varchar, SQL_varchar)
   SELECT str, str, str, str
   FROM   guidstrings

We have the same four columns, but they are longer, just below 150 characters in length. The data is random strings with hexadecimal digits, spaces and hyphens. There are no indexes on this table.

What we will test this time is a LIKE operation where the pattern has both a leading and a trailing wildcard. We run this for all four columns:

DECLARE @d         datetime2(3) = sysdatetime(),
        @searchstr varchar(50)  = '%abc%',
        @cnt       int
SELECT @cnt = COUNT(*) FROM guids WHERE Windows_nvarchar LIKE @searchstr
PRINT concat('Windows_nvarchar: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
             'Count: ', @cnt) 

SELECT @d = sysdatetime()
SELECT @cnt = COUNT(*) FROM guids WHERE SQL_nvarchar LIKE @searchstr
PRINT concat('SQL_nvarchar: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
             'Count: ', @cnt) 

SELECT  @d  = sysdatetime()
SELECT @cnt = COUNT(*) FROM guids WHERE Windows_varchar LIKE @searchstr
PRINT concat('Windows_varchar: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
             'Count: ', @cnt) 
SELECT @d = sysdatetime()
SELECT @cnt = COUNT(*) FROM guids WHERE SQL_varchar LIKE @searchstr
PRINT concat('SQL_varchar: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
             'Count: ', @cnt) 

I got this output on my machine:

Windows_nvarchar: 329 ms. Count: 11643

SQL_nvarchar: 332 ms. Count: 11643

Windows_varchar: 334 ms. Count: 11643

SQL_varchar: 51 ms. Count: 11643

This time the SQL collation wins the tops of the pops! For varchar, it's a little more than six times faster than the other columns.

You may get the idea to look at the query plan, but it is the same Table Scan for all. No, the explanation for this drastic difference lies elsewhere. For this operation, there is not only a full scan of the table. There is also a more or less full scan of each string. That is, SQL Server first has to look at the first character, and if it is not equal to a, it has to move to the next one and so on. If it finds an a, it has to see if the next is a b, and if it is not, it has to check again if it is an a. Unless the string abc is actually found, SQL Server has to look at all characters in the string but the last two. This is quite different from the lookup in the previous section, where SQL Server can stop immediately if there is a mismatch on the first character. The same is also true for a LIKE with only a trailing wildcard. And if SQL Server is a little smart, also for a LIKE with only a leading wildcard.

But for a LIKE with both a leading and trailing wildcard there is no escape from having to look at almost the entire string. This means that the cost of comparing a single character really starts to matter. And here lies the difference. For full Unicode there are large character tables that SQL Server needs to search for every character. This includes the column Windows_varchar, since for a Windows collation all operations are carried out in UTF‑16. But for an SQL collation, there are only 255 characters to consider for varchar, which means that the character tables are a lot smaller, and therefore faster. It goes without saying that the longer the strings are, the more noticeable is the difference.

Just to reinforce the point that when you have a point lookup with = or LIKE without a leading wildcard, the effect is nowhere equally drastic, or even noticeable, try this script:

DECLARE @d         datetime2(3) = sysdatetime(),
        @searchstr varchar(200) = 
              (SELECT SQL_varchar FROM guids WHERE ident = 4711),
        @cnt       int
SELECT @cnt = COUNT(*) FROM guids WHERE Windows_nvarchar = @searchstr
PRINT concat('Windows_nvarchar: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
             'Count: ', @cnt) 

SELECT @d = sysdatetime()
SELECT @cnt = COUNT(*) FROM guids WHERE SQL_nvarchar = @searchstr
PRINT concat('SQL_nvarchar: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
             'Count: ', @cnt) 

SELECT  @d  = sysdatetime()
SELECT @cnt = COUNT(*) FROM guids WHERE Windows_varchar = @searchstr
PRINT concat('Windows_varchar: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
             'Count: ', @cnt) 
SELECT @d = sysdatetime()
SELECT @cnt = COUNT(*) FROM guids WHERE SQL_varchar = @searchstr
PRINT concat('SQL_varchar: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
             'Count: ', @cnt) 

This is the output I get:

Windows_nvarchar: 14 ms. Count: 1

SQL_nvarchar: 15 ms. Count: 1

Windows_varchar: 19 ms. Count: 1

SQL_varchar: 15 ms. Count: 1

Beside LIKE with leading and trailing wildcards, there are a few more operations that may require a full scan of the string: charindex, patindex and replace come to mind.

So a conclusion of this is that if we plan to run such operations on long strings, we should make those columns varchar with an SQL collation? Not really. First of all, we can only select varchar with a legacy code page if we are dead sure that the data in the column will stay within the 255 characters of that code page. But even if we can permit us to make this choice, there is no need to set the collation for the column, since we can cast the collation on expression level in the operations where it matters. For instance:

SELECT charindex(@searchstr COLLATE SQL_Latin1_General_CP1_CI_AS, mycol)

Recall that it does not matter if you place the COLLATE clause with the variable or with the column. If you cast the collation of one sub-expression in a larger expression, this casts the collation for all sub-expressions in the larger expression. (This was something we discussed early in this article.)

There is another take on this, which is even faster than an SQL collation, and that does not lock you into the small character set of a legacy code page, to wit a binary collation:

DECLARE @d         datetime2(3) = sysdatetime(),
        @searchstr varchar(50)  = '%abc%',
        @cnt       int
SELECT @cnt = COUNT(*) FROM guids 
       WHERE Windows_nvarchar COLLATE Latin1_General_BIN2 LIKE @searchstr
PRINT concat('Binary collation: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
             'Count: ', @cnt) 

This is the output:

Binary collation: 22 ms. Count: 0

While fast, you may note that there is a serious problem: the result is wrong. However, there is a simple remedy for this. Just wrap both the column and the search parameter in upper:

DECLARE @d         datetime2(3) = sysdatetime(),
        @searchstr varchar(50)  = '%abc%',
        @cnt       int
SELECT @cnt = COUNT(*) FROM guids 
       WHERE upper(Windows_nvarchar) COLLATE Latin1_General_BIN2 
             LIKE upper(@searchstr)
PRINT concat('Binary collation: ', datediff(ms, @d, sysdatetime()), ' ms. ', 
             'Count: ', @cnt) 

Now I get:

Binary collation: 31 ms. Count: 11643

That is, a sort of roll-your-own case-insensitivity. It will not always give the same result as a proper CI collation. For instance, a search string %RASS% will not match Straße. Nevertheless, with some luck you may get away with it.

On the other hand, if users insist that the searches should be accent-insensitive as well, I think you will have no choice but to raise the white flag and stay with a regular Windows collation. But, who knows, maybe you can come up with something clever?

Note: Many years ago I wrote about how this type of searches can be made even faster by building your index. This text is not publicly available, but follow the link to get information about the book. Also, if users are searching for words rather than random letter combinations, there is all reason to look at full-text indexing.

Performance Tests – Introduction and Setup

We have now arrived at the less essential part of this chapter, that is data from my performance tests of all collations. This section has information about the test setup, and in the following sections I present various observations. These sections have a lot of numbers as well as queries that show how I arrive at those numbers. If you prefer, you can jump to the Executive Summary at the end of this chapter where I summarise my findings.

When I first developed the session for this topic in 2020, I wanted to have some performance data about different collations, so I could say something about the matter in my presentation. Therefore, I ran a test where I tested a number of operations on all 5508 official collations in SQL 2019. It kept my laptop busy for nine days. At home, I work from a desktop, and I only use the laptop when travelling or similar, so the laptop was idle apart from the test during these days. The SQL Server version was SQL 2019 CU8.

There is an apparent risk with such tests: you end up with Too Much Data™. Even worse, you may not be able to figure out what that data tells you. In this section, I will try to make some conclusions from the data, but, as you will find, they are not always that clearcut. If you are in a situation where you need to make a decision on collations from the point of view of performance, you can possibly use my findings as an initial indication. But you will need to conduct your own tests based on your workload and data. As we shall see later, there are some flaws in my test setup. There are probably more that I have yet to discover. And, no, you would test all 5540 collations, only those that would be an option for you.

I like to remind you that most workloads include many operations that do not depend on the collation. There are considerable chances that the impact of the choice of collation is completely dwarfed by everything else. Then again, there can certainly be critical operations where the collation matters beyond the two biggies we looked at in the two previous sections.

In the upcoming sections, I will present data where I'm looking at the impact of various collation-related properties:

In this section, I describe the test setup. It starts with a brief description of a database with the full test results that I'm making available with this article. I certainly do not expect everyone to download that database, but if you are curious about specific collations, this is where you need to go, as I keep the discussion on an overall level. You find a backup of the database in the file collationbase.bak. It includes the test data, the results, and some tables that classify collation families and collations. The size of the compressed backup is 75 MB and the size of the database itself is just under 300 MB in size.

Note: The tests in 2020 were not my first in this area. I ran similar tests in 2008 on an idle development server at work running SQL 2008 RC0. These tests, too, ran for over a week. Here I had less control over activity, but since SQL 2008 had not been released, we did not use the server for development. The results from 2008 are included in the database backup, but the discussion in this section is based only on the results from SQL 2019. That is more than enough.

There are twelve tables in the database.

words, worddata, randomtext
These are tables with data used in the tests.
collations
This table drove the test and controlled in which order the collations were tested. (It was filled in random order.)
teststmts
This table holds the actual test statements that were used in the tests for four of the five test operations. I discuss them more below.
families
This table holds all the collation families, with the number of collations per family. There is also a flag indicating whether the family relates to a "European" language. (As I will come back to later, it's quite an odd definition of "European".)
wincollations
This table has one row for each Windows collation with columns for the various properties of the collations. This table resurfaces in the queries that produce the data I present.
sqlcollations
This table has one row for each SQL collation with columns for their properties.
results
These are the raw results from the tests on SQL 2019. All tests were executed five times, not consecutively for the same collation, but spread out over those nine days. This is reflected in the column testno. There are two columns for the execution times in this table. The column tookms has the execution time in milliseconds and is used for most tests. The column tookµs has the execution time in microseconds and is used for the LOOKUP and MISFIT tests (see below).
results2008
The raw results from the tests on SQL 2008. As I recall, I think I only tested each operation once for each collation.
ratios and ratios2008
These tables are a derivation of the results and hold ratios against the collation that fared best in the given test, and there are some other ratios as well. The tables cover Windows collations only. This was an attempt to analyse the data in 2020, but when writing this text, I opted to ignore these tables.

For full disclosure, I have also made available three scripts that I used for the tests. Be warned that the scripts are largely uncommented, and I will not use space trying to explain them.

collationbase.sql
I used this script to build the test database. You will not be able to run it as-is, as I refer to files and other databases and servers to load the test data. (For practical reasons, I am not making these sources available). The second half of the script also has a number of analysing queries that I ran in 2020, but which I did not revisit for this article.
Testrunner.sql
This is the script that ran the tests for all collations with a code page, that is, the major bunch. It was when working with this script I learnt about the joys of scripts that span multiple databases with different collations. :-)
Testrunnernovarchar.sql
This is a variation of the Testrunner script for collations that do not support varchar. By having a separate script, the main script could be kept simpler.

I had five test operations as detailed below. I tested all operations with both varchar and nvarchar data. To this comes a variation of one test with a mix of data types. To simplify the presentation, I'm using generic table names in the queries below. The actual test queries use different tables, with one table for varchar data and one for nvarchar.

HASHGRP
This was a GROUP BY query that looked like this:
DECLARE @dummy int     
SELECT @dummy = COUNT(*)
FROM   (SELECT word
        FROM   testtbl
        GROUP  BY word
        HAVING SUM(data) > 5000) AS x
OPTION (HASH GROUP)
The reason for the variable is that I did not want to produce a result set as the consumption of the result set could distort the tests. I force a hash match for the GROUP BY, exactly because I wanted to get hashing into the test. (In the results table, this test goes by the label GROUPBY.)
LIKE
A LIKE operation with leading and trailing wildcards:
DECLARE @dummy int 
SELECT @dummy = COUNT(*) 
FROM   testtbl
WHERE  randomtext LIKE '%abc%'
LOOKUP
This was a test of a point lookup. Just like in the section The Data-Type Mismatch Accident, this test ran a loop for some time and counted how many lookups that were performed in that time. The body of the loop had 10 queries on this style:
SELECT @dummy = data FROM testtbl WHERE word = @word1
There was an index on the word column. The data recorded in the tookµs column is the computed average execution time for a single lookup.
LOOPJOIN
A test with a join, where I force nested loops:
DECLARE @dummy int
SELECT @dummy = SUM(k.data)
FROM   (SELECT word FROM testtbl WHERE id BETWEEN 150000 and 250000) d
JOIN   testkeys k ON d.word = k.word
OPTION (LOOP JOIN)
MISFIT
This test is a variation of the LOOKUP test where the table column is varchar, but the data to lookup is nvarchar. That is, the same mismatch that we looked at in the section The Data-Type Mismatch Accident.
ORDERBY
A test with a sorting operation:
DECLARE @dummy nvarchar(50)
; WITH CTE AS (
    SELECT word, row_number() OVER(ORDER BY word) AS rowno
    FROM   testtbl
)
SELECT @dummy = word
FROM   CTE 
WHERE  rowno = 67212
It seemed easier to use an ORDER BY in an OVER clause than a classic ORDER BY at the end to avoid having to take care of a full result set, which is likely to distort the measurements.

Just a quick note of what impact the collation may have on these operations. To repeat, the HASHGRP test is a hashing operation. The LIKE operation requires an equality test of many characters in the search string, as I discussed in the previous section. In the LOOKUP test, SQL Server only has to look at as many characters there are in the search string (and less if there is no match), but it also has to traverse an index. LOOPJOIN is in reality a number of lookup operation on equality that requires index traversal. In the ORDERBY test, the table has to be scanned, and then the data has to be sorted, which means that the ordering of the collation come into play.

Some information about the test data: For the LIKE test, the test data was ASCII only, just a number of concatenated GUIDs. For the other tests, I used words from a Slovenian spelling dictionary that I found on the net many years ago. To this set of data, I added a smaller number of place names. Thus, the data primarily includes characters supported by code page 1250. As we shall see later, this skews some of the tests.

Is Case-Sensitive Faster?

The first thing we will do is to test the hypothesis that case-sensitive collations are faster than case-insensitive, since they have to consider fewer alternative character matches. This first investigation also serves to find a methodology to make the analysis.

First of all, I decided to compute all results per test and per data type. Trying to aggregate data across tests would be problematic, since the different operations as such take a very different amount of time. I also decided to not aggregate out the data type, since in some cases there can be major differences. Thus, in the display of the various results there are typically eleven rows. (Two for each test plus MISFIT.)

Next, I had run each test five times. How to use these five test runs? At first, I simply took all values and computed averages for the entire test, as well as on family level (more about that later). But in some comparisons, I found strange results that I could conclude came from outliers. That is, values that deviate considerably from the other values for the same test, collation and data type. A common way to deal with outliers in scientific investigations is to use the median. However, the median is a little cumbersome to use in SQL Server, as there is no plain aggregate function for this purpose.

When I thought more about it, I decided to take the smallest value of the five test runs. After all, I had been running these tests on a laptop that was supposed to be idle. Say that the normal execution time for an operation is 1500 ms. How likely is it that all of a sudden there would be an execution below 1000 ms? On the other hand, despite that I had tried to disable as much as possible, the laptop may have decided to do some internal work which caused an occasional execution to take 3000 ms. A brief analysis of the ratio between the two smallest values seemed to confirm this. There were six combinations for which this ratio exceeded 1.2. (In contrast, the ratio between the highest and the next-highest value exceeded this limit for more than 8000 combinations. For around 1200 combinations, this ratio was 2.0 or more.)

Based on this, I decided to work with this view (which is not included in the backup):

CREATE VIEW resultsview AS
   SELECT test = IIF(label = 'GROUPBY', 'HASHGRP', label),
          datatype = IIF(isunicode = 1, 'nvarchar', 'varchar'), 
          collation, convert(float, MIN(coalesce(tookms, tookµs))) AS optime
   FROM   results
   GROUP  BY label, isunicode, collation

The view takes the occasion to change some names in the results table to fit better with my report queries. I also consolidate the two time columns into a single one, optime. Since I'm reporting data by test, there is no issue with having milliseconds and microseconds in the same column. The convert to float is there to give the best possible precision for average calculations later on.

Next is the question how to make the analysis as such. After playing around with various approaches I decided that the best is to compare two collation properties, in this case CI and CS against each other by computing the ratio between the values of optime. This lead to this query:

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.case_
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   WHERE  w.case_ IS NOT NULL
)
SELECT test, datatype,
       CI = str(AVG(CASE WHEN case_ = 'CI' THEN optime END), 7, 2),
       CS = str(AVG(CASE WHEN case_ = 'CS' THEN optime END), 7, 2),
       CS_CI_ratio = str(
               AVG(CASE WHEN case_ = 'CS' THEN optime END) /
               AVG(CASE WHEN case_ = 'CI' THEN optime END), 7, 2)
FROM   selection
GROUP  BY test, datatype
ORDER  BY test, datatype DESC

In the CTE, I join the results with the wincollations table, where I have columns there for various collation properties, and the one of interest here is the case_ column, which can be CI, CS or NULL. The latter is for the binary columns, whence the WHERE clause. A consequence of the join to wincollations is that the 77 SQL collations are excluded from the comparison.

In the main query, I use pivoting technique to get the average execution time across all collations for the two styles being evaluated. I then compute the ratio between the averages for CS and CI collations. A value < 1 means that CS is faster. A value > 1 means that CI is faster. The purpose of the str function is only for the output format.

Here is the output of the query:

testdatatypeCICSCS_CI_ratio
HASHGRPvarchar 951.83 952.04 1.00
HASHGRPnvarchar 836.71 839.80 1.00
LIKEvarchar3084.163282.81 1.06
LIKEnvarchar3059.683264.52 1.07
LOOKUPvarchar 8.13 8.11 1.00
LOOKUPnvarchar 7.11 7.06 0.99
LOOPJOINvarchar 763.29 768.17 1.01
LOOPJOINnvarchar 581.15 587.42 1.01
MISFITvarchar 26.55 26.54 1.00
ORDERBYvarchar5547.375571.58 1.00
ORDERBYnvarchar3683.373715.20 1.01

First, I like to point out that the purpose of the CI and CS columns is to show the magnitude of the values being compared. The column of main interest is CS_CI_ratio. We can quite directly see that the hypothesis proved to be false. There is not any significant difference at all, except for the LIKE test where case-sensitive is six-seven percent slower than case-insensitive. Possibly, this can be attributed to the fact that algorithms and character tables are optimised primarily for case-insensitive collations, since they are more commonly used.

Now, there is a possible problem with this result. This is computed across all collations. Could there be variations between different pairs of collations so that if we only looked at Czech_CI_AS and Czech_CS_AS, we would see quite different percentages than if we compared Mohawk_100_CI_AS_KS and Mohawk_100_CS_AS_KS? This is important, because if you only work with Czech data, you don't care what applies to Mohawk and vice versa.

To investigate this, I wrote a query that computes the percentages per collation family and then determines the min and max values for these percentages. A better choice might have been to compare pairs of collations of which the only difference is the CI/CS setting, but that would have required some acrobatics. On the other hand, since the wincollations table has a family column this was very easy to implement. I figured that it would be good enough to reveal whether there are considerable ranges between the min and max values. This is the query:

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.case_
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   WHERE  w.case_ IS NOT NULL
), perfamily AS (
   SELECT test, datatype, family,
          CS_CI_ratio = AVG(CASE WHEN case_ = 'CS' THEN optime END) /
                        AVG(CASE WHEN case_ = 'CI' THEN optime END)
   FROM   selection
   GROUP  BY test, datatype, family
)
SELECT test, datatype, 
       minvalue = str(MIN(CS_CI_ratio), 7, 2), 
       maxvalue = str(MAX(CS_CI_ratio), 7, 2)
FROM   perfamily
GROUP  BY test, datatype
ORDER  BY test, datatype DESC

Here is the output:

testdatatypeminvaluemaxvalue
HASHGRPvarchar 0.98 1.02
HASHGRPnvarchar 0.99 1.09
LIKEvarchar 1.04 1.09
LIKEnvarchar 1.05 1.09
LOOKUPvarchar 0.98 1.02
LOOKUPnvarchar 0.98 1.03
LOOPJOINvarchar 0.99 1.02
LOOPJOINnvarchar 1.00 1.07
MISFITvarchar 0.98 1.01
ORDERBYvarchar 1.00 1.02
ORDERBYnvarchar 1.00 1.25

For most tests, the ranges are not remarkably big, but there is one exception: the range for ORDERBY and nvarchar. Out of curiosity, I ran a variation of the query above to determine the collation families with the biggest ratios. It turned out to be very interesting. There is one single collation family that stands out with this ratio of 1.25 and the second biggest ratio for a family is 1.03. And this family is not just "any" family, no, it is Latin1_General, the most popular of them all! (While SQL collations are excluded from this comparison, recall that for nvarchar SQL_Latin1_General_CP1_CI_AS is the same as Latin1_General_CI_AS.)

You may ask "why is case-sensitive so much slower for Latin1_General?" This is something we will come back to later, but as a teaser: It is not that something is extra slow here.

Let's go back to the ranges as such. Apart from this noticeable exception, the ranges are fairly moderate, and you could argue they're not worth making noise about. However, for some of the comparisons that come later, there are ranges that are a lot wider.

While we are at it, let's also look at the results for accent-sensitivity. Here is a query where I have combined the two previous queries into one, and this is the pattern I will use for the rest of this chapter:

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.accent
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   WHERE  w.accent IS NOT NULL
), dataallover AS (
   SELECT test, datatype,
          "AS" = AVG(CASE WHEN accent = 'AS' THEN optime END),
          "AI" = AVG(CASE WHEN accent = 'AI' THEN optime END),
           AS_AI_ratio = AVG(CASE WHEN accent = 'AS' THEN optime END) /
                         AVG(CASE WHEN accent = 'AI' THEN optime END)
   FROM   selection
   GROUP  BY test, datatype
), perfamily AS (
   SELECT test, datatype, family,
          AS_AI_ratio = AVG(CASE WHEN accent = 'AS' THEN optime END) /
                        AVG(CASE WHEN accent = 'AI' THEN optime END)
   FROM   selection
   GROUP  BY test, datatype, family
), minmaxvalues AS (
   SELECT test, datatype, 
          minvalue = MIN(AS_AI_ratio), maxvalue = MAX(AS_AI_ratio)
   FROM   perfamily
   GROUP  BY test, datatype
)
SELECT a.test, a.datatype, 
       "AI"        = str(a."AI", 10, 2),
       "AS"        = str(a."AS", 10, 2),
       AS_AI_ratio = str(a.AS_AI_ratio, 7, 2),
       minvalue    = str(m.minvalue, 7, 2),
       maxvalue    = str(m.maxvalue, 7, 2)
FROM   dataallover a
JOIN   minmaxvalues m ON m.test     = a.test
                     AND m.datatype = a.datatype
ORDER  BY a.test, a.datatype DESC
testdatatypeAIASAS_AI_ratiominvaluemaxvalue
HASHGRPvarchar 951.76 952.10 1.00 0.98 1.02
HASHGRPnvarchar 837.74 838.76 1.00 0.99 1.01
LIKEvarchar 3185.95 3181.02 1.00 0.98 1.03
LIKEnvarchar 3161.88 3162.32 1.00 0.97 1.03
LOOKUPvarchar 8.07 8.18 1.01 0.99 1.03
LOOKUPnvarchar 7.04 7.13 1.01 0.98 1.03
LOOPJOINvarchar 761.87 769.59 1.01 0.99 1.03
LOOPJOINnvarchar 579.31 589.27 1.02 1.00 1.04
MISFITvarchar 26.48 26.61 1.00 0.99 1.02
ORDERBYvarchar 5486.34 5632.61 1.03 1.01 1.08
ORDERBYnvarchar 3626.44 3772.12 1.04 1.02 1.13

These results are nowhere close to startling. The only test with any sort of significance is the ORDERBY test, which shows that accent-insensitive is just a wee bit faster. You may wonder if that max ratio for ORDERBY is a gross outlier just like Latin1_General for case-sensitivity. The answer is nope. There are eight collations in the range 1.10 to 1.13. Interestingly enough, they are all families for Chinese, Japanese and Korean that only have version-80 and version-90 collations.

This could be related to a minor flaw in my wincollations table. The family column is based on the collation name, and therefore families that were renamed in SQL 2008 while retaining their LCID appear as two different families. This is due to that I came to the realisation of the significance of the LCID only after I had written the first draft of this chapter. I did not find it worthwhile to correct the flaw and rerun the queries and update all tables.

The Impact of the Collation Version

The next thing we will look at is how the version number of the collation may affect performance. Here is a query along the pattern above:

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.version
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   WHERE  w.binary IS NULL
     AND  w.surrogate = ''
     AND  w.utf8 = ''
), dataallover AS (
   SELECT test, datatype,
          "80/90" = AVG(CASE WHEN version IN ('80', '90') THEN optime END),
          "100"   = AVG(CASE WHEN version = '100' THEN optime END),
          "100_80/90_ratio" = AVG(CASE WHEN version = '100' THEN optime END) /
                              AVG(CASE WHEN version IN ('80', '90') THEN optime END)
   FROM   selection
   GROUP  BY test, datatype
), perfamily AS (
   SELECT test, datatype, family,
          "100_80/90_ratio" = AVG(CASE WHEN version = '100' THEN optime END) /
                              AVG(CASE WHEN version IN ('80', '90') THEN optime END)
   FROM   selection
   GROUP  BY test, datatype, family
), minmaxvalues AS (
   SELECT test, datatype, 
          minvalue = MIN("100_80/90_ratio"), maxvalue = MAX("100_80/90_ratio")
   FROM   perfamily
   GROUP  BY test, datatype
)
SELECT a.test, a.datatype, 
       "80/90"           = str(a."80/90", 10, 2),
       "100"             = str(a."100",   10, 2),
       "100_80/90_ratio" = str(a."100_80/90_ratio", 7, 2),
       minvalue          = str(m.minvalue, 7, 2),
       maxvalue          = str(m.maxvalue, 7, 2)
FROM   dataallover a
JOIN   minmaxvalues m ON m.test    = a.test
                     AND m.datatype = a.datatype
ORDER  BY a.test, a.datatype DESC

First of all, I'm ignoring the version-140 collations since they are only for Japanese. If you are working with Japanese and you want to know what the penalty might be for using these collations, I kindly refer you to the downloadable backup, see above. Or even better, run your own tests with Japanese data. As for the version-160 collations, they had not been released when I ran the tests.

Also, you may notice that I have lumped the version-80 and version-90 collations together. In an initial investigation, I did not find any major difference between them. More importantly, though, there are not very many collation families that have both a version-80 and version-90 collation, since the version-90 collations mainly were meant to provide support for more languages. On the other hand, version-100 brought new collations to almost all collation families existing both in version-80 and version-90. (Although, for reasons discussed above, this comparison fails to include the families that were renamed in SQL 2008.)

Note that I'm filtering out surrogate-compatible collations including UTF‑8 collations. We will come back to those later.

Here are the results.

testdatatype80/90100100_80/90_ratiominvaluemaxvalue
HASHGRPvarchar 882.86 969.62 1.10 1.03 1.23
HASHGRPnvarchar 775.80 858.44 1.11 1.04 1.27
LIKEvarchar 2903.28 3258.75 1.12 1.05 1.23
LIKEnvarchar 2889.83 3230.93 1.12 1.04 1.22
LOOKUPvarchar 7.76 8.04 1.04 0.99 1.15
LOOKUPnvarchar 6.82 7.18 1.05 1.01 1.17
LOOPJOINvarchar 716.80 765.86 1.07 0.98 1.25
LOOPJOINnvarchar 540.38 599.12 1.11 1.01 1.34
MISFITvarchar 24.56 26.81 1.09 1.00 1.24
ORDERBYvarchar 4638.24 5405.15 1.17 1.00 1.58
ORDERBYnvarchar 3037.35 3919.55 1.29 1.06 1.92

Certainly more distinct ratios than for CI/CS and AI/AS. Still, for all operations but ORDERBY, the overall average still suggests only a moderate penalty for using a version-100 collation and thereby getting support for a lot more languages.

We can also note that the ranges of the ratios are wider here, which indicates that the actual penalty depends on the collation family. That is, if you are in doubt whether to use a version-80/90 collation or a version-100 collation for your database, you may want to check out the data for your collation family in the collationresults database, as described above.

For the ORDERBY the difference is more pronounced, but the range is very wide, from a mere 6 % cost to almost a doubling in execution time. I decided to take a closer look to see if it is just one lone family far away from anything else. Here are the top and bottom five families:

Hungarian_Technical 1.92
Vietnamese 1.76
Latvian 1.74
Slovak 1.69
Albanian 1.67
...  
Uzbek_Latin 1.08
Kazakh 1.08
Syriac 1.08
Divehi 1.08
Latin1_General 1.06

When it comes to the top 5, Hungarian_Technical is not a radical outlier, but its ratio is distinctively higher than the next four, for which the ratios are closer to each other. We will briefly come back to the case of Hungarian_Technical later. In the bottom, it is very even. What is a little remarkable is that four of the five families in the group with the lowest ratios are families that do not have a version-80 collation, but only a version-90 collation. Despite what I said about the difference between version-80 and version-90 not being that big.

You may ask why the newer collations are slower. I guess the answer is that since they have more defined characters, the character tables are larger and therefore are more complex to traverse.

The Difference Between varchar and nvarchar

Let's now look at what impact the data type has. The query for this purpose has some differences compared to the previous ones. Since we are comparing the data types, we are only aggregating on the test itself. When it comes to the min and max values, this particular comparison permits us to use the collation rather than the family to capture the absolute extreme values. Here is the query:

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.collation
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   WHERE  w.binary IS NULL
     AND  w.utf8 = ''
     AND  r.test <> 'MISFIT'
), dataallover AS (
   SELECT test,
          varchar    = AVG(CASE WHEN datatype = 'varchar'  THEN optime END),
          nvarchar   = AVG(CASE WHEN datatype = 'nvarchar' THEN optime END),
          nv_v_ratio = AVG(CASE WHEN datatype = 'nvarchar' THEN optime END) /
                       AVG(CASE WHEN datatype = 'varchar'  THEN optime END)
   FROM   selection
   GROUP  BY test
   HAVING AVG(CASE WHEN datatype = 'varchar'  THEN optime END) IS NOT NULL
), percollation AS (
   SELECT test, collation,
          nv_v_ratio = AVG(CASE WHEN datatype = 'nvarchar' THEN optime END) /
                       AVG(CASE WHEN datatype = 'varchar'  THEN optime END)
   FROM   selection
   GROUP  BY test, collation
   HAVING AVG(CASE WHEN datatype = 'varchar'  THEN optime END) IS NOT NULL
), minmaxvalues AS (
   SELECT test, 
          minvalue = MIN(nv_v_ratio), maxvalue = MAX(nv_v_ratio)
   FROM   percollation
   GROUP  BY test
)
SELECT a.test, 
       varchar              = str(a.varchar, 10, 2),
       nvarchar             = str(a.nvarchar,   10, 2),
       "nvarchar / varchar" = str(a.nv_v_ratio, 7, 2),
       minvalue             = str(m.minvalue, 7, 2),
       maxvalue             = str(m.maxvalue, 7, 2)
FROM   dataallover a
JOIN   minmaxvalues m ON m.test = a.test
ORDER  BY a.test

As for the filter in the first CTE, we will look at binary collations separately in just an instance. We will also cover SQL collations later. For the same reason, I am filtering out UTF‑8 collations. The MISFIT test is filtered out since it's only executed for varchar. As for the HAVING clauses in the later CTE, it serves to filter out the collations without a code page, and which do not support varchar.

Here are the results:

testvarcharnvarcharnvarchar / varcharminvaluemaxvalue
HASHGRP 943.51 834.50 0.88 0.82 1.02
LIKE 3166.57 3151.13 1.00 0.82 1.22
LOOKUP 7.97 7.07 0.89 0.79 0.97
LOOPJOIN 750.69 581.73 0.77 0.66 0.87
ORDERBY 5174.44 3664.59 0.71 0.55 0.84

As you can see, varchar is slower on the overall average, although there is some span between the min and max values. The reason that varchar is slower is very likely due to that for Windows collations all operations are carried out in UTF‑16, and there is a cost for the conversion. You may note that the LIKE test is an exception where the average is exactly 1.00. My assumption is that since the conversion only occurs once per row, the cost of the conversion is dwarfed by all the character comparisons required for LIKE.

Before we conclude that nvarchar is a better choice for performance for Windows collations, we need to consider the fact that in many (but not all!) collations varchar data is only eight bit and takes up half the space of nvarchar. Now, in my test setup the tables are narrow, with the string column being the widest column. So you could say that my tests do take the size difference in regard to some extent. However, not all operations are string operations. Say that the query is:

SELECT * FROM tbl WHERE nonindexed_int_column = 99

and that the table has many string columns. There is all reason to expect the query to run faster if these string columns are varchar than if they are nvarchar, since the table will be smaller and therefore faster to scan. Again, assuming a single-byte code page like for English or Swedish.

Let's also look at the same comparison for binary collations (just flip the condition in the query above), because there are some changes:

testvarcharnvarcharnvarchar / varcharminvaluemaxvalue
HASHGRP 724.08 653.11 0.90 0.84 0.98
LIKE 252.50 169.06 0.67 0.38 0.90
LOOKUP 6.30 6.26 0.99 0.94 1.05
LOOPJOIN 481.65 445.56 0.93 0.85 1.02
ORDERBY 1610.73 1822.90 1.13 1.04 1.24

The result here may at first seem a little bewildering. Now the LIKE operations have the biggest gain for nvarchar over varchar, quite the opposite to the result for the non-binary collations. But there is a quite simple explanation: For the LIKE operation, the character comparison is a lot faster since it is by code point and without need for large character tables. On the other hand, the cost for converting from the code page of varchar to UTF‑16 is the same, so its relative part of the execution time is now a lot more significant.

As you may note, the range for the LIKE operation is unusually wide here. When I dug into the details, I found that there is not one range here, but two. And the average value of 0.67 is not particularly meaningful. One range goes from 0.38 to 0.42. All collations in this range are Korean, Japanese and Chinese collations. That is, collations where varchar has a multi-byte code page. The other range goes from 0.74 to 0.90 and all collations here are those with single-byte code pages. Apparently, the conversion of UTF‑16 is more expensive for the multi-byte character sets. Even, as when in this case, the data is ASCII only.

Another noticeable detail is that for the ORDERBY test, varchar comes out as a winner for the binary collations. My guess is that for binary collations, they don't cast to UTF-16, but sort by the code point according to the code page, so varchar wins thanks to the smaller table size.

So far, we have ignored the SQL collations. However, for this comparison of data types, there is all reason to look at them. We have already seen one case where the performance difference is drastic.

I don't show the full query for SQL collations, but only show the first CTE; the rest of the query is the same as above.

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, s.collation
   FROM   resultsview r
   JOIN   sqlcollations s ON s.collation = r.collation
   WHERE  s.style LIKE 'C%'
     AND  r.test <> 'MISFIT'

The filter on the style column serves to filter out binary collations.

Here are the results:

testvarcharnvarcharnvarchar / varcharminvaluemaxvalue
HASHGRP 733.48 773.14 1.05 0.97 1.16
LIKE 323.34 2853.51 8.83 6.54 14.40
LOOKUP 6.53 6.79 1.04 0.97 1.13
LOOPJOIN 497.67 523.97 1.05 0.94 1.20
ORDERBY 1958.88 2993.32 1.53 1.20 2.04

As we learnt earlier, the ratio for the LIKE operation is drastic. Almost a factor of nine on average, and up to a factor > 14 in the most extreme case. You may recall that in the demo in the section LIKE '%abc%' the ratio was just above six. The greater ratio in this case is due to the column in this test table being about ten times longer.

For the ORDERBY test, there also is a significant gain with using varchar over nvarchar. nvarchar is a little more than 50 % slower on average. In the most extreme case, the execution time for varchar is half the time for nvarchar. On the other hand, for the HASHGRP, LOOKUP and LOOPJOIN operations the gain is on average a mere 4-5 %, but as you can see there is some variation for the individual collations.

For completeness, here are the results for the few binary SQL collations:

testvarcharnvarcharnvarchar / varcharminvaluemaxvalue
HASHGRP 725.50 652.50 0.90 0.89 0.93
LIKE 213.50 172.75 0.81 0.74 0.86
LOOKUP 6.36 6.28 0.99 0.95 1.01
LOOPJOIN 488.50 439.00 0.90 0.83 0.99
ORDERBY 1615.25 1822.00 1.13 1.05 1.21

Oops, nvarchar seems to be somewhat faster here. Except, then, for sorting.

Binary Collations

Let's now compare binary collations to regular collations. We can expect them to be faster, but how much faster? I used this query, along the pattern I have established:

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.binary
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
), dataallover AS (
   SELECT test, datatype,
          binary  = AVG(CASE WHEN binary IS NOT NULL  THEN optime END),
          regular = AVG(CASE WHEN binary IS NULL      THEN optime END),
          binary_vs_regular = 
                   AVG(CASE WHEN binary IS NULL     THEN optime END) /
                   AVG(CASE WHEN binary IS NOT NULL THEN optime END) 
   FROM   selection
   GROUP  BY test, datatype
), perfamily AS (
   SELECT test, datatype, family,
          binary_vs_regular = 
                   AVG(CASE WHEN binary IS NULL     THEN optime END) /
                   AVG(CASE WHEN binary IS NOT NULL THEN optime END) 
   FROM   selection
   GROUP  BY test, datatype, family
), minmaxvalues AS (
   SELECT test, datatype, 
          minvalue = MIN(binary_vs_regular), maxvalue = MAX(binary_vs_regular)
   FROM   perfamily
   GROUP  BY test, datatype
)
SELECT a.test, a.datatype, 
       binary            = str(a.binary, 10, 2),
       regular           = str(a.regular, 10, 2),
       binary_vs_regular = str(a.binary_vs_regular, 7, 2),
       minvalue          = str(m.minvalue, 7, 2),
       maxvalue          = str(m.maxvalue, 7, 2)
FROM   dataallover a
JOIN   minmaxvalues m ON m.test     = a.test
                     AND m.datatype = a.datatype
ORDER  BY a.test, a.datatype DESC

This time I am not filtering out UTF‑8 collations.

The results:

testdatatypebinaryregularbinary_vs_regularminvaluemaxvalue
HASHGRPvarchar 723.92 951.93 1.31 1.17 1.53
HASHGRPnvarchar 653.14 838.25 1.28 1.16 1.50
LIKEvarchar 253.16 3183.48 12.58 6.99 17.47
LIKEnvarchar 169.06 3162.10 18.70 16.75 21.66
LOOKUPvarchar 6.30 8.12 1.29 1.20 1.42
LOOKUPnvarchar 6.26 7.08 1.13 1.07 1.26
LOOPJOINvarchar 481.52 765.73 1.59 1.44 1.88
LOOPJOINnvarchar 445.59 584.29 1.31 1.17 1.63
MISFITvarchar 9.57 26.55 2.78 2.36 3.39
ORDERBYvarchar 1610.77 5559.47 3.45 2.77 4.73
ORDERBYnvarchar 1822.51 3699.28 2.03 1.60 3.11

For the LIKE test, the gains are tremendous. They are also remarkable for the sorting operation, with a reduction of more than a factor of 3 on average for varchar while for nvarchar, the binary collation "only" cuts the execution time in half. On average, that is. Depending on the collation family, the gain can be more than a factor of three or just merely 60 %.

For the HASHGRP, LOOKUP and LOOPJOIN tests, the gains are not equally spectacular, but they are still significant. Finally, the MISFIT test benefits unexpectedly well from the binary collation, which is more an observation of curiosity. It's still a coding mistake.

While these gains are likely to be less noticeable in a full workload that includes a lot more than string operations, it cannot be denied that there is a certain allure over these numbers. At the same time, there is the obvious problem: Users are not likely to accept the results from a system that uses a binary collation throughout. Nevertheless, we will come back to this idea in the next chapter about design considerations.

And now for the comparison you have been waiting for since you learnt about the BIN and BIN2 collation. How much do the BIN collations gain by only swapping the first byte? Here is the query:

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.binary
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   WHERE  w.binary IS NOT NULL
), dataallover AS (
   SELECT test, datatype,
          BIN   = AVG(CASE WHEN binary = 'BIN'  THEN optime END),
          BIN2  = AVG(CASE WHEN binary = 'BIN2' THEN optime END),
          BIN2_vs_BIN = AVG(CASE WHEN binary = 'BIN2'  THEN optime END) /
                        AVG(CASE WHEN binary = 'BIN' THEN optime END) 
   FROM   selection
   GROUP  BY test, datatype
), perfamily AS (
   SELECT test, datatype, family,
          BIN2_vs_BIN = AVG(CASE WHEN binary = 'BIN2'  THEN optime END) /
                        AVG(CASE WHEN binary = 'BIN' THEN optime END) 
   FROM   selection
   GROUP  BY test, datatype, family
), minmaxvalues AS (
   SELECT test, datatype, 
          minvalue = MIN(BIN2_vs_BIN), maxvalue = MAX(BIN2_vs_BIN)
   FROM   perfamily
   GROUP  BY test, datatype
)
SELECT a.test, a.datatype, 
       BIN         = str(a.BIN, 10, 2),
       BIN2        = str(a.BIN2, 10, 2),
       BIN2_vs_BIN = str(a.BIN2_vs_BIN, 7, 2),
       minvalue    = str(m.minvalue, 7, 2),
       maxvalue    = str(m.maxvalue, 7, 2)
FROM   dataallover a
JOIN   minmaxvalues m ON m.test     = a.test
                     AND m.datatype = a.datatype
ORDER  BY a.test, a.datatype DESC

And here are the results:

testdatatypeBINBIN2BIN2_vs_BINminvaluemaxvalue
HASHGRPvarchar 723.93 723.91 1.00 0.91 1.10
HASHGRPnvarchar 653.02 653.26 1.00 0.94 1.05
LIKEvarchar 252.55 253.76 1.00 0.94 1.27
LIKEnvarchar 169.08 169.05 1.00 0.89 1.07
LOOKUPvarchar 6.30 6.31 1.00 0.97 1.04
LOOKUPnvarchar 6.33 6.20 0.98 0.93 1.02
LOOPJOINvarchar 481.03 482.01 1.00 0.92 1.09
LOOPJOINnvarchar 451.68 439.55 0.97 0.91 1.04
MISFITvarchar 9.58 9.55 1.00 0.96 1.03
ORDERBYvarchar 1610.07 1611.46 1.00 0.99 1.02
ORDERBYnvarchar 1938.79 1707.11 0.88 0.87 0.89

As you can see, the differences are hardly noticeable, except for the ORDERBY test with nvarchar where BIN2 is a good deal faster. Really why it is so, I don't have much of an idea. Maybe they implemented a new and better library for BIN2 while they were at it.

When it comes to the ranges, they are fairly narrow with one noticeable exception of LIKE and varchar where BIN2 gives you 27 % more execution time in the extreme case. I decided to write a query to find if there was something interesting hiding here. It was. There is one family that contributes to this high value, and it is Latin1_General.

This had me stumped for a while, until I realised that Latin1_General has one more binary collation than all the other families, to wit, Latin1_General_100_BIN2_UTF8. I added a filter to the query above to filter out that sole UTF‑8 collation and boom! The max value for LIKE with varchar dropped to 1.08. And the average for Latin1_General dropped from 1.27 to 0.97.

It gets even more pronounced if we compare the two Latin_General_100_BIN2 collations, as in this query:

SELECT test, collation, optime 
FROM   resultsview 
WHERE  collation LIKE 'Latin1%100_BIN2%' 
  AND  datatype = 'varchar' 
ORDER BY test, collation

The output:

testcollationoptime
HASHGRPLatin1_General_100_BIN2747
HASHGRPLatin1_General_100_BIN2_UTF8688
LIKELatin1_General_100_BIN2211
LIKELatin1_General_100_BIN2_UTF8406
LOOKUPLatin1_General_100_BIN26.31
LOOKUPLatin1_General_100_BIN2_UTF86.03
LOOPJOINLatin1_General_100_BIN2501
LOOPJOINLatin1_General_100_BIN2_UTF8452
MISFITLatin1_General_100_BIN210.03
MISFITLatin1_General_100_BIN2_UTF88.8
ORDERBYLatin1_General_100_BIN21627
ORDERBYLatin1_General_100_BIN2_UTF81619

It's quite remarkable: The UTF8 collation is slightly faster in all tests, except for LIKE where the execution time for UTF8 is almost double of the execution time for Latin1_General_100_BIN2. Partly, it ties into the observation we made earlier with LIKE and binary collations where the cost for the conversion to UTF‑16 proved to be higher for the multi-byte code pages for Chinese, Japanese and Korean. UTF‑8 is indeed also a multi-byte code page, but it is a very different one, since the conversion does not require any character tables, but is a plain transformation. So, this difference is still a bit mysterious to me. Yet, this is something we will see more of in the next section.

Performance of UTF‑8 Collations

So far, I've mainly filtered out the UTF‑8 collations from the queries, because I wanted to look at them separately, and that is what we are going to do now.

In this section, I'm restricting the queries to collation families that I have labelled as "European" which is quite a misnomer. This flag is set families for languages that use characters in the range U+0000 to U+07FF in Unicode, and thus take up at most two bytes in UTF‑8 encoding. Thus, this "European" flag is also set for Arabic, Persian and a few more collation families for languages outside Europe. On the other hand, it is not set for Georgian, since the Georgian script is beyond U+07FF in Unicode. I've also excluded Maltese, since this is one of these collation families without a legacy code page, so there is nothing to compare to.

Before we actually look at the UTF‑8 collations, we should observe that all UTF‑8 collations are surrogate compatible, and we have not looked at whether the support for supplementary characters comes with a cost, so let's first look at this. Here is a query do this:

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.surrogate
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   JOIN   families f ON w.family = f.family
   WHERE  f.European = 1
     AND  w.version IN ('90', '100')
     AND  w.binary IS NULL
     AND  w.utf8 = ''
), dataallover AS (
   SELECT test, datatype,
          plain       = AVG(CASE WHEN surrogate = ''   THEN optime END),
          SC          = AVG(CASE WHEN surrogate = 'SC' THEN optime END),
          SC_vs_plain = AVG(CASE WHEN surrogate = ''   THEN optime END) /
                        AVG(CASE WHEN surrogate = 'SC' THEN optime END)
   FROM   selection
   GROUP  BY test, datatype
), perfamily AS (
   SELECT test, datatype, family,
          SC_vs_plain = AVG(CASE WHEN surrogate = ''   THEN optime END) /
                        AVG(CASE WHEN surrogate = 'SC' THEN optime END) 
   FROM   selection
   GROUP  BY test, datatype, family
), minmaxvalues AS (
   SELECT test, datatype, 
          minvalue = MIN(SC_vs_plain), maxvalue = MAX(SC_vs_plain)
   FROM   perfamily
   GROUP  BY test, datatype
)
SELECT a.test, a.datatype, 
       plain       = str(a.plain, 10, 2),
       SC          = str(a.SC, 10, 2),
       SC_vs_plain = str(a.SC_vs_plain, 7, 2),
       minvalue    = str(m.minvalue, 7, 2),
       maxvalue    = str(m.maxvalue, 7, 2)
FROM   dataallover a
JOIN   minmaxvalues m ON m.test     = a.test
                     AND m.datatype = a.datatype
ORDER  BY a.test, a.datatype DESC

I filter on the version to only include version-90 and version-100, since these are the versions where we have this choice. And since the purpose is to look at the effect of SC-support as such, I'm filtering out the UTF‑8 collations. As noted above, I have a filter to include only "European" collation families.

Here are the results:

testdatatypeplainSCSC_vs_plainminvaluemaxvalue
HASHGRPvarchar 970.21 970.63 1.00 0.99 1.01
HASHGRPnvarchar 865.62 865.53 1.00 0.99 1.01
LIKEvarchar 3263.20 3285.42 0.99 0.97 1.01
LIKEnvarchar 3229.49 3249.69 0.99 0.97 1.02
LOOKUPvarchar 7.99 7.99 1.00 0.99 1.00
LOOKUPnvarchar 7.22 7.21 1.00 0.99 1.01
LOOPJOINvarchar 768.38 769.13 1.00 0.99 1.01
LOOPJOINnvarchar 607.40 607.45 1.00 0.98 1.01
MISFITvarchar 26.75 26.74 1.00 0.99 1.01
ORDERBYvarchar 5428.63 5428.60 1.00 1.00 1.00
ORDERBYnvarchar 4003.94 4004.26 1.00 1.00 1.00

Can you say y-a-w-n? Well, now that we know that the support for supplementary characters has no effect on performance, we can take the surrogates out of the equation.

We will first compare the UTF‑8 collations to the SC collations with a legacy code page, so that UTF8 is the only difference. Here is a query:

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.utf8
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   JOIN   families f ON f.family = w.family
   WHERE  f.European = 1
     AND  w.version IN ('90', '100')
     AND  w.binary IS NULL
     AND  w.surrogate = 'SC'
), dataallover AS (
   SELECT test, datatype,
          legacy_CP      = AVG(CASE WHEN utf8 = ''     THEN optime END),
          UTF8           = AVG(CASE WHEN utf8 = 'UTF8' THEN optime END),
          UTF8_vs_legacy = AVG(CASE WHEN utf8 = 'UTF8' THEN optime END) /
                           AVG(CASE WHEN utf8 = '' THEN optime END)
   FROM   selection
   GROUP  BY test, datatype
), perfamily AS (
   SELECT test, datatype, family,
          UTF8_vs_legacy = AVG(CASE WHEN utf8 = 'UTF8' THEN optime END) /
                           AVG(CASE WHEN utf8 = ''     THEN optime END) 
   FROM   selection
   GROUP  BY test, datatype, family
), minmaxvalues AS (
   SELECT test, datatype, 
          minvalue = MIN(UTF8_vs_legacy), maxvalue = MAX(UTF8_vs_legacy)
   FROM   perfamily
   GROUP  BY test, datatype
)
SELECT a.test, a.datatype, 
       legacy_CP      = str(a.legacy_CP, 10, 2),
       UTF8           = str(a.UTF8, 10, 2),
       UTF8_vs_legacy = str(a.UTF8_vs_legacy, 7, 2),
       minvalue       = str(m.minvalue, 7, 2),
       maxvalue       = str(m.maxvalue, 7, 2)
FROM   dataallover a
JOIN   minmaxvalues m ON m.test     = a.test
                     AND m.datatype = a.datatype
ORDER  BY a.datatype, a.test

You may note that I've changed the order in the ORDER BY clause from the previous queries so that we get nvarchar and varchar in two different groups. Here are the results:

testdatatypelegacy_CPUTF8UTF8_vs_legacyminvaluemaxvalue
HASHGRPnvarchar 865.53 866.40 1.00 0.99 1.01
LIKEnvarchar 3249.69 3242.79 1.00 0.98 1.04
LOOKUPnvarchar 7.21 7.21 1.00 0.99 1.01
LOOPJOINnvarchar 607.45 608.60 1.00 0.99 1.02
ORDERBYnvarchar 4004.26 4005.19 1.00 1.00 1.00
HASHGRPvarchar 970.63 988.37 1.02 0.99 1.04
LIKEvarchar 3285.42 3275.69 1.00 0.97 1.02
LOOKUPvarchar 7.99 8.52 1.07 1.00 1.10
LOOPJOINvarchar 769.13 813.17 1.06 0.99 1.11
MISFITvarchar 26.74 27.57 1.03 1.00 1.06
ORDERBYvarchar 5428.60 6567.92 1.21 1.05 1.32

First, for nvarchar, there are no discernible differences at all. And nor should there be, since the only difference between the two groups of collations is the code page for the varchar data type. For nvarchar, Something_SC and Something_SC_UTF8 are identical. (SQL Server does not consider a mix of them to be a collation conflict, as we learnt earlier).

On the other hand, for varchar, it seems that for most collation families the performance of the UTF‑8 collations is somewhat inferior to the collations with a legacy code page. For most of the tests, the cost is small, but for the ORDERBY test, we see the execution time increasing on average by more than 20 %, and in the worst case by more than 30 %.

The only difference here is the code page, so this is a little worrying. Why would the conversion from UTF‑8 to UTF‑16 take up to 32 % more time than the conversion from a legacy code page? As I discussed above, the conversion from UTF‑8 can be done without mapping tables, so I would naïvely expect it to be faster. I wrote a query to look at the ratios per family for the ORDERBY test. When I looked at the results, the penny dropped: There is a problem with my test setup, and it is not as bad as the results above suggest, as will see in a second.

I will not show the result per family, since that takes up too much space. Instead, here is a more direct query that points to the sore part, to wit the legacy code page:

; WITH selection AS (
   SELECT r.optime, w.utf8, w.collation,
          codepage = collationproperty(f.family + '_100_CI_AS', 'codepage')
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   JOIN   families f ON f.family = w.family
   WHERE  f.European = 1
     AND  w.version IN ('90', '100')
     AND  w.binary IS NULL
     AND  w.surrogate = 'SC'
     AND  r.test = 'ORDERBY'
     AND  r.datatype = 'varchar'
), ratios AS (
   SELECT codepage = collationproperty(leg.collation, 'codepage'),
          UTF8_vs_legacy = utf.optime / leg.optime
   FROM   selection leg
   JOIN   selection utf ON utf.collation = leg.collation + '_UTF8'
)
SELECT codepage, 
       UTF8_vs_legacy = str(AVG(UTF8_vs_legacy), 7, 2),
       minvalue       = str(MIN(UTF8_vs_legacy), 7, 2),
       maxvalue       = str(MAX(UTF8_vs_legacy), 7, 2)
FROM   ratios
GROUP  BY codepage
ORDER  BY UTF8_vs_legacy
I only look at the ORDERBY test (because this is where we see values that stand out), and only for varchar. For each pair of collations with the same properties but the code page, I compute the ratio, with the execution time for the collation with the legacy code page as the base. Then I compute the average, min and max values of these ratios per the legacy code page.

This is the output from the query (with name of the code pages added manually for clarity):

codepageUTF8_vs_legacyminvaluemaxvalue
1257 (Baltic)1.061.041.08
1250 (Eastern Europe)1.111.071.14
1254 (Turkish)1.201.141.25
1256 (Arabic)1.251.211.28
1252 (Latin‑1)1.251.201.31
1255 (Hebrew)1.291.271.30
1253 (Greek)1.311.291.33
1251 (Cyrillic)1.311.241.35

We can see that the penalty for UTF‑8 appears to be a lot lower for the families where the legacy code page is 1257 and 1250. How could that be? The answer lies in my test data. The main bulk of the data comes from a Slovenian spelling dictionary, and Slovenian uses three letters not in ASCII: č, š and ž. When the test script runs, it loads the data from a source table to the test table like this:

CREATE TABLE mbcswords(id     int          NOT NULL,
                       word   varchar(50)  NOT NULL,
                       data   int          NOT NULL,
                       CONSTRAINT pk_mbcswords PRIMARY KEY(id)
)
INSERT mbcswords (id, word, data)
   SELECT id, word, data
   FROM   collationbase.dbo.worddata

mbcswords is the actual test table. The collation of word is the database collation, the collation to be tested. The code page for Slovenian is 1250, so when the test database has a collation with code page 1250, there is no conversion between code pages. When the code page is 1257, there is a conversion, but since all three of č, š and ž are also present in CP1257, the Slovenian words are inserted unmolested. It's a different thing when the code page is for a non-Latin script. In this case, the insertion results in all three of č, š and ž being replaced by fallback characters in the ASCII range.

Why would this matter? My assumption is that in the conversion routines there is a fast-track for the ASCII range, since there is nothing to convert. Only for characters with a code point ≥ 128, there is a need to consult a lookup table. Thus for these languages, the conversion from the legacy code page to UTF‑16 is faster than for the languages with code pages 1250 and 1257. On the other hand, all UTF‑8 collations work with the same data since there is no need for fallback characters.

Or, in other words, for collations families with another legacy code page than 1250 or 1257, the comparison between Collation_SC and Collation_SC_UTF8 is not an apples-to-apples comparison.

Note: For code pages 1252 and 1254, the only missing letter is č. This can explain why the ratio for the Turkish code page is lower. How come that Latin‑1 and Arabic have the same average ratio, I don't know.

Thus, the values in this test we should pay attention to are those for the Eastern European and Baltic code pages. The others are not relevant. These numbers still suggest a penalty for UTF‑8, but they are less alarming that it seemed at first.

There is another and a more important lesson from this observation: the outcome of the test depends a lot on the data. What if the test data had been English words, with very few characters outside ASCII? Or what if the data had been Ukrainian with very few ASCII characters? I don't want to venture to say what the outcome would be in these cases. All I can say is that if you need to know for your data, you need to test yourself. (And when you test yourself, you will not test all 5540 collations, but only the few that are relevant to you.)

Besides the flaw in the test setup, there is a problem with this particular comparison from a different perspective. Is there any relevance in it? Would you and your team find yourself in the situation where you ask yourselves Should we use Slovenian_100_CI_AS_SC with CP1250 or Slovenian_100_CI_AS_SC_UTF8 for our varchar data? Rather, you are likely to have concluded that you need full Unicode support so that you can handle Slovenian, Swedish, Greek and other languages in the same database. So your question would rather be: Should we use varchar or nvarchar?

The next query attempts to answer this question, by comparing the performance for varchar and nvarchar for the UTF‑8 collations. For the reasons I just discussed, the query only looks at collation families for which the legacy code page is 1250 and 1257. For reference, the query includes the same comparison for non-UTF8 collations. We have already looked at such a comparison, but the values are slightly different from before due to the filter on the code page. Here is the query:

; WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.collation, w.utf8
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   JOIN   families f ON w.family = f.family
   WHERE  w.binary IS NULL
     AND  r.test <> 'MISFIT'
     AND  convert(int, collationproperty(f.family + '_100_CI_AS', 'Codepage'))
          IN (1250, 1257)
), dataallover AS (
   SELECT test, utf8,
          nvarchar   = AVG(CASE WHEN datatype = 'nvarchar' THEN optime END),
          varchar    = AVG(CASE WHEN datatype = 'varchar'  THEN optime END),
          v_nv_ratio = AVG(CASE WHEN datatype = 'varchar'  THEN optime END) /
                       AVG(CASE WHEN datatype = 'nvarchar' THEN optime END)
   FROM   selection
   GROUP  BY test, utf8
), percollation AS (
   SELECT test, utf8, collation,
          v_nv_ratio = AVG(CASE WHEN datatype = 'varchar' THEN optime END) /
                       AVG(CASE WHEN datatype = 'nvarchar'  THEN optime END)
   FROM   selection
   GROUP  BY test, utf8, collation
), minmaxvalues AS (
   SELECT test, utf8,
          minvalue = MIN(v_nv_ratio), maxvalue = MAX(v_nv_ratio)
   FROM   percollation
   GROUP  BY test, utf8
)
SELECT a.test, a.utf8,
       varchar              = str(a.varchar, 10, 2),
       nvarchar             = str(a.nvarchar, 10, 2),
       "varchar / nvarchar" = str(a.v_nv_ratio, 7, 2),
       minvalue             = str(m.minvalue, 7, 2),
       maxvalue             = str(m.maxvalue, 7, 2)
FROM   dataallover a
JOIN   minmaxvalues m ON m.test = a.test
                     AND m.utf8 = a.utf8
ORDER  BY a.test, a.utf8

As previously when we compared performance by data types, the ranges for the ratios are taken from the individual collations and therefore are absolute max values. Here are the results:

testutf8varcharnvarcharvarchar / nvarcharminvaluemaxvalue
HASHGRP 996.40 882.39 1.13 0.98 1.21
HASHGRPUTF8 1041.24 922.96 1.13 1.01 1.21
LIKE 3279.66 3247.74 1.01 0.91 1.15
LIKEUTF8 3434.06 3397.82 1.01 0.91 1.16
LOOKUP 8.28 7.34 1.13 1.04 1.24
LOOKUPUTF8 8.81 7.53 1.17 1.10 1.24
LOOPJOIN 825.54 626.96 1.32 1.21 1.52
LOOPJOINUTF8 861.71 658.53 1.31 1.22 1.40
ORDERBY 6184.76 4232.25 1.46 1.33 1.83
ORDERBYUTF8 7289.04 4695.16 1.55 1.45 1.69

We can see what we already knew: there is a cost for using varchar over nvarchar due to the conversion to UTF‑16. The penalty is about the same for legacy code pages and UTF‑8 for four of the tests. For the ORDERBY test, the penalty is yet a little higher for UTF‑8. For some reason, the range is distinctively narrower for the UTF‑8 collations than the collations with legacy code pages. Why, I don't really know.

But also this comparison is somewhat dubious. Would you first settle on Modern_Spanish_100_CI_AS_SC_UTF8 and then decide on whether you should go with varchar or nvarchar? Rather your choices might be these two: 1) Use nvarchar with the version-80 collation Modern_Spanish_CI_AS you always have used through the years. 2) Switch to Modern_Spanish_100_CI_AS_SC_UTF8 to save space.

Here is a query to make this comparison. I've restricted this comparison to collation families that exist both in version-80 and version-100. On the other hand, I am not filtering on the code page, since this is a comparison between two ways of storing the full range of Unicode. This is the longest query so far:

; WITH family_80_100 AS (
   SELECT family
   FROM   wincollations
   WHERE  version IN ('80', '100')
   GROUP  BY family
   HAVING COUNT(DISTINCT version) = 2
), selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.utf8, w.version
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   JOIN   families f ON f.family = w.family
   WHERE  f.European = 1
     AND  w.binary IS NULL
     AND  r.test <> 'MISFIT'
     AND  EXISTS (SELECT *
                  FROM   family_80_100 f8
                  WHERE  f8.family = w.family)
), dataallover AS (
   SELECT test,
          nvarchar_80  = AVG(CASE WHEN datatype = 'nvarchar' AND 
                                       version = '80' THEN optime END),
          varchar_UTF8 = AVG(CASE WHEN datatype = 'varchar'  AND 
                                       utf8 = 'UTF8' THEN optime END),
          UTF8_to_16   = AVG(CASE WHEN datatype = 'varchar'  AND 
                                       utf8 = 'UTF8' THEN optime END) /
                         AVG(CASE WHEN datatype = 'nvarchar' AND 
                                       version = '80' THEN optime END) 
   FROM   selection
   GROUP  BY test
), perfamily AS (
   SELECT test, family,
          UTF8_to_16   = AVG(CASE WHEN datatype = 'varchar' AND 
                                       utf8 = 'UTF8' THEN optime END)  /
                         AVG(CASE WHEN datatype = 'nvarchar' AND
                                       version = '80' THEN optime END) 
   FROM   selection
   GROUP  BY test, family
), minmaxvalues AS (
   SELECT test, 
          minvalue = MIN(UTF8_to_16), maxvalue = MAX(UTF8_to_16)
   FROM   perfamily
   GROUP  BY test
)
SELECT a.test, 
       varchar_UTF8 = str(a.varchar_UTF8, 10, 2),
       nvarchar_80  = str(a.nvarchar_80,   10, 2),
       UTF8_to_16   = str(a.UTF8_to_16, 7, 2),
       minvalue     = str(m.minvalue, 7, 2),
       maxvalue     = str(m.maxvalue, 7, 2)
FROM   dataallover a
JOIN   minmaxvalues m ON m.test = a.test
ORDER  BY a.test
Ranges are again computed on family level, not on collation level.

The results are not very encouraging for UTF‑8:

testvarchar_UTF8nvarchar_80UTF8_to_16minvaluemaxvalue
HASHGRP 999.17 773.82 1.29 1.20 1.44
LIKE 3292.93 2863.14 1.15 1.08 1.25
LOOKUP 8.56 6.82 1.26 1.20 1.35
LOOPJOIN 820.91 539.76 1.52 1.40 1.72
ORDERBY 6669.88 3017.41 2.21 1.92 2.80

Here we get the combination of three performance "evils". First, we get the cost for the richer character support in version-100. Next, we get the cost for converting to UTF‑16. And to add insult to injury, as the previous tests have suggested, there is a higher cost for converting from UTF‑8 than from a legacy code page.

For all tests, the UTF‑8 choice incurs a noticeable performance cost compared to using nvarchar with the old version-80 collation. For the LIKE test it is only 15 % on average, but already for the more common operations of a lookup or a GROUP BY with a hash aggregate, we have an overhead of 25-30 % on average. For a loop join, we see a 50 % increase, and in worst case more than 70 %. And for a sorting operation, execution time more than doubles on average, and in worst case close to tripling.

One consideration here is the substantial space savings you get when your data is in Latin script. Several of my tests take this out of the equation so far that the tested operations lead to table scans. And despite these scans, nvarchar still beats varchar with UTF‑8 by a wide margin. On the other hand, my tables are narrow. If I would add twenty more string columns with the same collation and data type, the results could be quite different. Again, you need to test with your workload. With your tables. With your data. And with the collations you are considering. Only a maniac like me would test all collations.

Finally, you may already have decided that you need a version-100 collation and you want full Unicode support, so the choice is between Breton_100_CI_AS and Breton_100_CI_AS_SC_UTF8. For reference, here is a query for this situation:

WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.surrogate, w.utf8
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   JOIN   families f ON f.family = w.family
   WHERE  f.European = 1
     AND  w.binary IS NULL
     AND  r.test <> 'MISFIT'
     AND  w.version = '100'
), dataallover AS (
   SELECT test,
          nvarchar_100 = AVG(CASE WHEN datatype = 'nvarchar' AND 
                                       surrogate = '' THEN optime END),
          varchar_UTF8 = AVG(CASE WHEN datatype = 'varchar'  AND 
                                       utf8 = 'UTF8' THEN optime END),
          UTF8_to_16   = AVG(CASE WHEN datatype = 'varchar'  AND 
                                       utf8 = 'UTF8' THEN optime END) /
                         AVG(CASE WHEN datatype = 'nvarchar' AND 
                                       surrogate = '' THEN optime END) 
   FROM   selection
   GROUP  BY test
), perfamily AS (
   SELECT test, family,
          UTF8_to_16   = AVG(CASE WHEN datatype = 'varchar' AND 
                                       utf8 = 'UTF8' THEN optime END)  /
                         AVG(CASE WHEN datatype = 'nvarchar' AND 
                                       surrogate = '' THEN optime END) 
   FROM   selection
   GROUP  BY test, family
), minmaxvalues AS (
   SELECT test, 
          minvalue = MIN(UTF8_to_16), maxvalue = MAX(UTF8_to_16)
   FROM   perfamily
   GROUP  BY test
)
SELECT a.test, 
       varchar_UTF8 = str(a.varchar_UTF8, 10, 2),
       nvarchar_100 = str(a.nvarchar_100,   10, 2),
       UTF8_to_16   = str(a.UTF8_to_16, 7, 2),
       minvalue     = str(m.minvalue, 7, 2),
       maxvalue     = str(m.maxvalue, 7, 2)
FROM   dataallover a
JOIN   minmaxvalues m ON m.test = a.test
ORDER  BY a.test

Results:

testvarchar_UTF8nvarchar_100UTF8_to_16minvaluemaxvalue
HASHGRP 995.02 872.09 1.14 1.09 1.17
LIKE 3299.76 3254.38 1.01 0.98 1.04
LOOKUP 8.55 7.25 1.18 1.15 1.20
LOOPJOIN 817.38 612.50 1.33 1.28 1.45
ORDERBY 6631.74 4078.04 1.63 1.46 1.95

The UTF‑8 collations do not fare equally bad as when compared with the version-80 collations, but the performance still lags behind UTF‑16 and nvarchar.

There is one thing hiding in this data that turned up when I got curious to see what that max value of 1.95 for ORDERBY might be. This was the query:

WITH selection AS (
   SELECT r.test, r.datatype, r.optime, w.family, w.surrogate, w.utf8
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   JOIN   families f ON f.family = w.family
   WHERE  f.European = 1
     AND  w.binary IS NULL
     AND  r.test = 'ORDERBY'
     AND  w.version = '100'
)
SELECT family,
       UTF8_to_16   = AVG(CASE WHEN datatype = 'varchar' AND 
                                    utf8 = 'UTF8' THEN optime END)  /
                      AVG(CASE WHEN datatype = 'nvarchar' AND 
                                    surrogate = '' THEN optime END) 
FROM   selection
GROUP  BY family
ORDER BY UTF8_to_16 DESC

Here are the top 5:

familyUTF8_to_16
Latin1_General1,94931150616021
German_PhoneBook1,75661650233524
Bashkir1,75617908851282
Frisian1,75410183330766
Ukrainian1,75127126744408

Latin1_General seems to suffer a lot more than any other family by the change to UTF‑8. There is a certain reason for this, which we will learn more about in the next section.

Comparing Performance of Collation Families

You may ask: Are some languages slower than others? And the answer to that question is, yes. Not that this is a very interesting question. It's not like someone would go oh, Slovak is so slow, let's use a Frisian collation instead. No way their users would accept that.

So this section is from the lighter side. We will look at what are the fastest and slowest collation families for each test operation. Nevertheless, there is quite an interesting observation hiding here.

To make this investigation, I first computed the average execution time for each test operation by family. I decided to only look at version-100 collations, so that the investigation would not be skewed by some families being version-80 only and have shorter execution times for that reason. This means that the collation families that were renamed in SQL 2008 fell out of the game.

I decided to filter out the UTF‑8 collations to be nice to the collation families with no legacy code page. The only collations with varchar they have are the UTF‑8 collations, and as we have learnt, UTF‑8 are slower than the legacy code pages in these tests. This resulted in this CTE:

WITH selection AS (
   SELECT r.test, r.datatype, AVG(r.optime) AS optime, w.family
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   WHERE  w.binary IS NULL
     AND  w.version = '100'
     AND  w.utf8 = ''
   GROUP  BY r.test, r.datatype, w.family
)

Next, I computed rankings within each test/data type pair with the row_number function, in both forward and reverse order, so I could find top and bottom families. I also computed how much slower, as a ratio, a family is compared to the top-ranking family. This lead to this CTE:

, rankings AS (
   SELECT test, datatype, family,
          rank    = row_number() OVER(PARTITION BY test, datatype 
                                      ORDER BY optime),
          revrank = row_number() OVER(PARTITION BY test, datatype 
                                      ORDER BY optime DESC),
          ratio   = optime / first_value(optime) 
                            OVER (PARTITION BY test, datatype ORDER BY optime)
   FROM  selection
)

In the final query, I determined the top and bottom five families. In order to get a more compact output for the article, I joined the rankings CTE with itself, so I got the fastest and slowest collation on the top line for a test, then the second fastest and second slowest and so on. For reasons that will prevail, I'm ordering by data type first. This is the final part of the query.

SELECT a.test, a.datatype, a.family, 
       str(a.ratio, 6, 4), b.family, str(b.ratio, 6, 4)
FROM   rankings a
JOIN   rankings b ON a.test = b.test
                 AND a.datatype = b.datatype
                 AND a.rank = b.revrank
WHERE  a.rank <= 5 OR b.revrank <= 5
ORDER BY a.datatype DESC, a.test, a.ratio

Here is the result of that query:

   Top 5 Bottom 5
Test Datatype Family Ratio Family Ratio
HASHGRPvarcharCyrillic_General1.0000Vietnamese1.2649
HASHGRPvarcharGerman_PhoneBook1.0008Latvian1.1958
HASHGRPvarcharBosnian_Cyrillic1.0019Hungarian_Technical1.1958
HASHGRPvarcharGeorgian_Modern_Sort1.0044Albanian1.1786
HASHGRPvarcharGreek1.0048Mohawk1.1763
LIKEvarcharCyrillic_General1.0000Hungarian_Technical1.1518
LIKEvarcharHebrew1.0023Vietnamese1.1473
LIKEvarcharGeorgian_Modern_Sort1.0023Upper_Sorbian1.1514
LIKEvarcharLatin1_General1.0041Latvian1.1451
LIKEvarcharChinese_Traditional_Stroke_Count1.0044Lithuanian1.1435
LOOKUPvarcharGeorgian_Modern_Sort1.0000Vietnamese1.2126
LOOKUPvarcharFrench1.0011Latvian1.1739
LOOKUPvarcharLatin1_General1.0027Hungarian_Technical1.1715
LOOKUPvarcharGerman_PhoneBook1.0029Lithuanian1.1602
LOOKUPvarcharFrisian1.0075Slovak1.1416
LOOPJOINvarcharBosnian_Cyrillic1.0000Vietnamese1.3575
LOOPJOINvarcharBashkir1.0015Hungarian_Technical1.3214
LOOPJOINvarcharKazakh1.0036Latvian1.3208
LOOPJOINvarcharLatin1_General1.0043Slovak1.2907
LOOPJOINvarcharUkrainian1.0065Czech1.2797
MISFITvarcharGeorgian_Modern_Sort1.0000Vietnamese1.2241
MISFITvarcharFrisian1.0052Hungarian_Technical1.2133
MISFITvarcharFrench1.0067Slovak1.1677
MISFITvarcharLatin1_General1.0075Albanian1.1658
MISFITvarcharGerman_PhoneBook1.0175Mohawk1.1631
ORDERBYvarcharCyrillic_General1.0000Vietnamese1.8252
ORDERBYvarcharBosnian_Cyrillic1.0001Hungarian_Technical1.7386
ORDERBYvarcharTatar1.0001Latvian1.6902
ORDERBYvarcharUkrainian1.0003Slovak1.6442
ORDERBYvarcharKazakh1.0011Czech1.6064
HASHGRPnvarcharLao1.0000Latvian1.1943
HASHGRPnvarcharSyriac1.0000Vietnamese1.2770
HASHGRPnvarcharPashto1.0023Mohawk1.1921
HASHGRPnvarcharUkrainian1.0027Hungarian_Technical1.1913
HASHGRPnvarcharArabic1.0031Czech1.1721
LIKEnvarcharFrench1.0000Upper_Sorbian1.1593
LIKEnvarcharCyrillic_General1.0009Hungarian_Technical1.1466
LIKEnvarcharKazakh1.0024Vietnamese1.1441
LIKEnvarcharFrisian1.0036Lithuanian1.1407
LIKEnvarcharDari1.0048Croatian1.1387
LOOKUPnvarcharLatin1_General1.0000Vietnamese1.1899
LOOKUPnvarcharFrisian1.0116Hungarian_Technical1.1780
LOOKUPnvarcharUkrainian1.0172Mohawk1.1586
LOOKUPnvarcharBashkir1.0174Hungarian1.1444
LOOKUPnvarcharKorean1.0176Latvian1.1376
LOOPJOINnvarcharLatin1_General1.0000Vietnamese1.4778
LOOPJOINnvarcharHebrew1.0520Mohawk1.3794
LOOPJOINnvarcharSyriac1.0560Hungarian_Technical1.3699
LOOPJOINnvarcharFrisian1.0566Slovak1.3356
LOOPJOINnvarcharBosnian_Cyrillic1.0567Latvian1.3348
ORDERBYnvarcharLatin1_General1.0000Vietnamese2.1728
ORDERBYnvarcharFrisian1.1077Mohawk1.9324
ORDERBYnvarcharGerman_PhoneBook1.1102Hungarian_Technical1.9175
ORDERBYnvarcharCyrillic_General1.1144Slovak1.7951
ORDERBYnvarcharDari1.1146Latvian1.7585

Let's first look in the Bottom 5 columns. Do you get the feeling that the same languages keep reappearing? Vietnamese, Hungarian, Latvian, Mohawk, Upper Sorbian and a few more. As for why these families are slower, the answer is that they have more complex rules. Vietnamese is famous for being about the only language where a letter can take more than one accent. I don't know about all languages, but Hungarian, Croatian, Slovak and Czech are all languages with digraph letters. (As you may recall from the section Collation Families, in Hungarian cs counts as a letter of its own, and there are a few more of those.) It seems reasonable to assume that the handling of these digraphs is more complex, leading to increased execution time.

When it comes to Hungarian, there is an interesting observation here. As I briefly mentioned earlier, there are two Hungarian collation families, Hungarian and Hungarian_Technical. It's only the Hungarian collation that properly handles the digraphs. Hungarian_Technical simplifies them and handles cs as two separate letters. So it is a little mysterious that Hungarian_Technical is more common among the bottom-5 than the proper Hungarian family. If I change the filter so that the query only spans version-80 collations, the table turns upside down more or less. Hungarian is now the slowest in most comparisons and Hungarian_Technical is nowhere to be seen on the right side. Instead, it makes a couple of appearances on the left side, among the top-fives! You may recall that when we studied the impact of the collation version, Hungarian_Technical was the family with the biggest penalty for version-100 compared to version-80 with a bit of a margin. What they did with the Hungarian collations in SQL 2008, I don't know, but it seems that they did something.

It is also worth paying attention to the actual values in the Ratio column. For the HASHGRP, LIKE, LOOKUP and MISFIT tests, the slowest collation is only 20-25 % slower than the fastest. The difference gets more pronounced for the LOOPJOIN operation, particularly for nvarchar. For the ORDERBY operation, Vietnamese is slightly below half the speed of the fastest collation for nvarchar. Although, with nvarchar, as we will learn shortly, there is a special situation on the left side that makes the ratios on the right side to be higher than they are for varchar.

Let's now move over to the Top 5 column. For the Bottom 5, we saw the same languages reappearing, but here the top quintets are a lot more diverse. If we look at the actual ratios, we can understand why. The ratios for positions two to five are in most cases very close to 1.0000. In fact, so low, there is all reason to say that it is not significant, and these rankings are just due to random circumstances during my tests. But wait! This is true for all tests on varchar. It is also true for HASHGRP and LIKE for nvarchar. But then something starts to happen. First only lightly. For LOOKUP, Latin1_General has a small but distinct margin to Frisian in second place. For the LOOPJOIN operation, the pattern is more pronounced. Hebrew at number two after Latin1_General has a ratio of 1.05. And for the ORDERBY operation, Latin1_General takes an even greater lead.

So it seems that Latin1_General comes out as a winner. But only for nvarchar? There is more to it. Consider the query with this definition of the selection CTE, constraining the selection to case-sensitive collations for nvarchar:

WITH selection AS (
   SELECT r.test, r.datatype, AVG(r.optime) AS optime, w.family
   FROM   resultsview r
   JOIN   wincollations w ON w.collation = r.collation
   WHERE  w.binary IS NULL
     AND  w.version = '100'
     AND  w.case_ = 'CS'
     AND  r.datatype = 'nvarchar'
     AND  w.utf8 = ''
   GROUP  BY r.test, r.datatype, w.family

To save some space, I only show the top 3 for nvarchar for each test:

testfamilyratio
HASHGRPSyriac1.0000
HASHGRPLao1.0043
HASHGRPArabic1.0049
LIKEFrench1.0000
LIKEUkrainian1.0065
LIKEUzbek_Latin1.0074
LOOKUPUkrainian1.0000
LOOKUPCyrillic_General1.0011
LOOKUPUzbek_Latin1.0017
LOOPJOINLatin1_General1.0000
LOOPJOINHebrew1.0160
LOOPJOINFrisian1.0195
ORDERBYFrisian1.0000
ORDERBYGerman_PhoneBook1.0019
ORDERBYCyrillic_General1.0061

Poof! The superiority of Latin1_General is gone entirely. (Well, almost. There is a slight trace of it in the LOOPJOIN test.)

As you may guess, if I change the filter to only select case-insensitive collations, it's the other way round:

testfamilyratio
HASHGRPLao1.0000
HASHGRPKazakh1.0025
HASHGRPUkrainian1.0026
LIKEFrisian1.0000
LIKEKazakh1.0010
LIKEBosnian_Cyrillic1.0016
LOOKUPLatin1_General1.0000
LOOKUPFrisian1.0280
LOOKUPArabic1.0387
LOOPJOINLatin1_General1.0000
LOOPJOINDari1.0895
LOOPJOINSyriac1.0905
ORDERBYLatin1_General1.0000
ORDERBYFrisian1.2544
ORDERBYGerman_PhoneBook1.2575

The supremacy for Latin1_General is now even more pronounced. But only for the LOOKUP, LOOPJOIN and ORDERBY tests. What is going on?

Before I try to answer that question, you may recall that when we looked at case-sensitive vs. case-insensitive, we found that case-sensitive is 25 % slower than case-insensitive for Latin1_General, while for all other families the ratio was never more than 1.03. We can see now that it is not case-sensitive is unusually slow with Latin1_General. It is case-insensitive that is unusually fast with this family.

I have an idea why we see this effect with only these three operations. Hashing just looks at the character string. LIKE performs a lot of comparisons, but they are all equality operations. On the other hand, the other three also include inequality operations such as < and >. ORDERBY has the most of these since it's sorting the data. But they are also used with the LOOKUP and LOOPJOIN operations when finding data through indexes.

Exactly what is going on, I don't know. Maybe there is a special optimised code path for Latin1_General_CI in nvarchar. After all, this is a very common collation setting, not the least if we include SQL_Latin1_General_CP1_CI_AS which for nvarchar is the same as Latin1_General_CI_AS. It could also be that the starting point for the character tables are geared for case-insensitive Latin1_General. But if so, shouldn't we see the same result for varchar, since for Windows collations, varchar is converted to UTF‑16 internally? I can only speculate. Maybe the decision to take the special code path is made before the investigation if data needs to be converted.

Whatever, this observation has bearing on our investigations of UTF‑8 collations. The last observation in the previous section was that the penalty of using varchar with Latin_General_100_UTF8 over using Latin_General_100 with nvarchar was considerably higher than for any other collation family. Now we can see that it's not really only a penalty, but with UTF‑8 we are missing out on a bonus. Then again, that does not change the fact that my tests suggest almost a doubling of the execution time with UTF‑8 over nvarchar for Latin1_General, a very commonly used collation family.

You may ask what happens if I change the version filter to only include version-80 collations. The answer is that the results are similar, although the supremacy is somewhat smaller. For the ORDERBY test, the number-two collation has a ratio of 1.14. Which, by the way, of all families is Hungarian_Technical!

Executive Summary

This section summaries the findings of my performance tests, without giving any details on how I arrived at the conclusions. For those details, you need to read the previous sections.

When comparing varchar and nvarchar (including UTF‑8 collations), the tests take the space reduction with varchar in consideration so far that the test operations mainly are scans. But the test tables are narrow, and if there are ten more string columns, it is likely to matter whether they are varchar or nvarchar.

I like to stress that the observations I've made depend on the test tables and the test data I used. If collation speed is critical to you, you need to run your own tests with your data profile and workload. And only with the collations you are actually considering.

Design Considerations and Final Thoughts

In this final chapter, I will try to bring everything I have discussed in this article together by giving some recommendations and discuss some possible strategies in addition to the occasional recommendations and preferences I have dropped earlier in this article.

Some things will come out quite naturally and seem to be self-evident. Other things I'm saying are more radical – and I would lie if I said that I live by those radical ideas myself. And don't be confused if I seem to contradict myself in this chapter. As often, the answer when you work with SQL Server is: It depends.

One or Multiple Collations for Your Organisation?

Ideally, you should settle for one single collation for your organisation. This has a lot of advantages:

Unfortunately, this noble goal can be impossible to achieve in total perfection. Particularly, if that single collation is any other than SQL_Latin1_General_CP1_CI_AS and your developers work from LocalDB, which only permits this collation as the server collation. Then again, if they work with that collation, and your corporate collation is another case-insensitive collation, there may not be very many issues, as long as deployment is by scripts. (Some developers think that you should take a copy of their database directly into production. This is bad for many more reasons than just collation issues.) True, would your corporate collation be a case-sensitive collation, there will be a lot of pain when the developers have spelt MyTable in a number of ways. But I don't think that very many organisations would make that choice, since it is likely to result in a poor user experience.

Beyond LocalDB, there are several reasons why a strategy for having single collation in your organisation could break down:

There may be more things that are all valid reasons to why you have multiple collations in your estate. But there is one reason that is not valid, and that is that someone just clicked Next-Next when installing SQL Server without checking the collation tab. Or for that matter just accepted the default collation in the Azure Portal without further thoughts about it. You should always make a conscious choice about the collation when you install SQL Server or create an SQL Server resource in the cloud.

Given that most likely there is more than one person in your organisation that install SQL Server instances, and of which some may be juniors, there is all reason to have a written instruction for how to install SQL Server according to your organisation's standards. The collation should be one point in that instruction.

As noted above, there may be plenty of reasons why you cannot stick to single collation. However, those vendor databases and other exceptions may be few enough, so that it still makes sense to have a corporate collation for your organisation.

But it may also be the case that the business of your organisation is such that you will use collations of all sorts. I have a radical idea for this situation: For SQL Server instances intended for development, always use different collations for the server and database level. This will force developers to always use COLLATE DATABASE_DEFAULT in their temp tables. This makes it easy to move databases between instances. Adding the COLLATE clause later to all stored procedures (and other application code) could prove to be prohibitively expensive. Thus, the only chance to get it right, is to get it right from the start. Yes, your developers will moan when you introduce it. But they will get used to it. :-)

Since new databases could be created on a dev server every now and then, it can be a good idea to add a server-level DDL trigger for CREATE DATABASE that checks the collation.

I should point out that if you are an ISV selling database applications, you have all reason to embrace this model. As an ISV you should not force the customer to install your application on a separate instance of the sole reason your temp tables do not have COLLATE DATABASE_DEFAULT.

Which Collation Should I Choose?

All through this article, I have touched on this subject, but it may be worth doing a recap to make sure that all points are collected in one place. I should stress that the discussion here is about the choice of collation for new servers and databases, not that you should go and change the collation of your existing databases – that is way too expensive.

The first thing to remember is that when you install SQL Server, Microsoft will suggest a default collation from your system locale in Windows. Defaults can have a magic spell on us, and we may think that the vendor suggests a particular default, because they think it is the best for us. Nothing could be more wrong in this case. Earlier, I quoted a sentence or two from the topic Collation and Unicode support in the SQL Server documentation. Here is the entire paragraph (under the header SQL Server collations) from which I took those sentences. Emphasis mine.

During SQL Server setup, the default installation collation setting is determined by the operating system (OS) locale. You can change the server-level collation either during setup or by changing the OS locale before installation. For backward compatibility reasons, the default collation is set to the oldest available version that's associated with each specific locale. Therefore, this isn't always the recommended collation. To take full advantage of SQL Server features, change the default installation settings to use Windows collations. For example, for the OS locale "English (United States)" (code page 1252), the default collation during setup is SQL_Latin1_General_CP1_CI_AS, and it can be changed to its closest Windows collation counterpart, Latin1_General_100_CI_AS_SC.

(One could ask how backwards compatibility can be important here – that's backwards compatibility for those who just click Next-Next or who run unattended install without specifying /SQLCOLLATION, two very bad habits. But that's Microsoft's choice, and I know that Microsoft takes backwards compatibility very seriously.)

So there is all reason to challenge Microsoft's defaults and make your own choices. Then again, if you have a number of instances installed with a certain collation, you will not start installing servers and databases with a different collation on a whim. And nor should you – it should be an informed decision taken by your organisation.

Next, my recommendation is to stay away from the SQL collations. They are legacy collations, and they have some funny quirks, not the least SQL_Latin1_General_CP1_CI_AS as we have seen. That collation may be default in a lot of places, but it is not a good choice. As we have learnt, there is a programming accident where an SQL collation will be a performance disaster. Thus, for the rest of this discussion, I will assume that your choice will always be a Windows collation when it comes to user data. (As you may recall, SQL_Latin1_General_CP1_CI_AS is the default catalogue collation in Azure SQL Database. For the particular realm of metadata, I think that this collation is perfectly OK. After all, all metadata is nvarchar.)

As we have learnt, a collation name consists of a number of components. The first component is the collation family, and in many cases the choice here is self-evident: You pick a collation from a collation family that fits the language of your organisation. If you are in Slovakia, you are likely to select a Slovak collation, if you are in Morocco, you are likely to go for an Arabic collation and so on. I say "likely" here, because you know the local habits better than I do.

Almost as equally self-evident is the choice of case-sensitivity. For most production databases, the users will appreciate a case-insensitive collation when they search for things. For development, it may be a different matter. As I discussed earlier, case-sensitive can be better for development databases, particularly for ISVs aiming at an international market. Then again, for an in-house system where the original development database may eventually become the production database, there is little reason to have different collations in development and production.

Accent-sensitivity is an interesting one, since the default in many cases is accent-sensitive. But is really this what users want? They search for resume, but they don't get any hits, because someone actually spelled résumé correctly? Or would Swedish users appreciate searching for Verner and not getting hits for Werner? (A name that can be spelled either way.) I suspect that in many cases, users would prefer accent-insensitive searches, so my recommendation is that you give this a serious thought.

For the other three types of sensitivity – kana, width and variation selectors – they mainly concern Japanese. It would be foolish of me to give advice here, as my knowledge of Japanese is poor. (But I can't escape the personal reflection that the situations where width-sensitivity is desired must be very specialised.)

Left to discuss are code pages (including UTF‑8), version numbers and surrogate support. They are more technical in nature, and if you ask the business users in these technical terms you will get blank stares back. Unfortunately, though, if you make the wrong choice, you are likely to hear from them later when things do not work as they expect.

Let's first look at the version numbers. For many national languages that use the Latin script, you are likely to get by with a version-80 collation, and the same is also true for several languages using the Cyrillic script as well as Arabic, Greek, Hebrew, Armenian, Georgian, Thai and Korean. As for Chinese and Japanese, I will come back to those a little later. When I say get by, I mean that things will work normally most of the time, but there could be occasional issues. The characters that are most likely to cause problems are emojis and other funny characters like 🎉 (example from an email that I got the day I wrote this sentence). They will display correctly, but when users try to search for them in queries, they will not find them, because with these old collations, SQL Server has zero understanding of the bytes that represent these characters.

Let it be said: To many extents, the version-80 collations are also legacy collations just like the SQL collations. And for languages that do not fit into the previous paragraph, do absolutely not use a version-80 collation! One reason is immediately apparent: there are no version-80 collations for the language, but there are version-100 collations (and maybe also version-90 collations). But there is another reason, which we have covered in this article. To wit, the older the collation, the more undefined code points, and when you have undefined code points in your data, strange things can happen as I showed you in the section The Version Number. replace and charindex do not find these characters, they are ignored in comparisons, and so on. This includes many of the languages of India, it includes languages spoken in Ethiopia like Amharic and Tigrinya, and quite a few more languages.

And, no, this is not something that only concerns languages that use other scripts than Latin and Cyrillic. A fair share of Latin letters was added with version-90 and version-100. For instance, some of the characters used in Pinyin – the official way to write Chinese with Latin letters – are undefined in version-90 and version-80 collations. For widely spoken languages that use Latin letters as their regular script, you can assume full support from the version-80 collations. But if you need to cover text in a language with a smaller number of speakers, don't count on it. When in doubt, go for a version-100 collation. Yes, as noted in the previous chapter, there is a small performance penalty for using a version-100 collation over an older one. But I'm inclined to say that there are few systems where this would be much of an issue.

Next question is, surrogate support or not? That is, an _SC collation or collation without _SC? This is an easy one. If you have decided to use a version-90 or version-100 collation, there is no reason not to use a surrogate-compatible collation. If my performance tests are to be believed, there is no performance cost for _SC collations over plain collations at all. And as noted above, the characters not supported by version-80 you are most likely to see in a "Western" system are emojis and similar that are beyond the Unicode base plane. To have full support for these, you need to have a surrogate-compatible collation.

A special case here are the ideographic characters, used by Chinese, Japanese and to some extent Korean. My understanding is that the Unicode base plane includes all the characters you need for everyday Chinese and Japanese. But then there is a full plane of supplementary characters that is reserved for "extended" ideographic characters and a tertiary plane with even more ideographic characters. My knowledge of Chinese and Japanese is utterly thin, but I have a feeling that even if these characters in the supplementary planes are more on the exotic side, they are likely to appear often enough to give you problems if you are using an old collation that cannot handle them. So for these languages you should at minimum use a version-100 collation.

At minimum? Yes. Recall that for Japanese there are the version-140 collations, and I can't see any reason why you should not use them for Japanese. For Chinese there are the version-160 collations, but they are currently only available in Azure SQL Database and Azure SQL Managed Instance. Also, they are restricted to the Chinese_Simplified collation families, so they don't apply to Taiwan and other places which use Chinese Traditional. But if the version-160 collations are available and applicable to you, I can't but see that they are your best choice.

The last property of the collation to consider is the code page. Use a legacy code page (like CP932, CP1250, CP1252 and so on) or a UTF8 collation? If it is only a matter of a single application, and you have already decided to use nvarchar for your Unicode data, my tests suggest that a legacy code page will give you somewhat better performance for the data you store as varchar. (As for what data you would store that way is something I will elaborate on in the next section.) However, you should not make collation choices on a per-application basis, but rather settle on a corporate policy. And in that perspective, I absolutely recommend going for a UTF‑8 collation, so that you have the choice per application whether to store Unicode data in varchar or nvarchar.

To conclude this section: My recommendation is that you use the collation Family_100_CI_AI_SC_UTF8, where Family is the collation family that is best suited for your data. I trust you to be better informed to make that decision than I am. (As noted above, this recommendation is modified for Japanese and Chinese with regards to the version number.) I want to stress that this is a recommendation for new servers, databases and tables. For existing databases, keep the collation you have, unless you are running into real problems. And even then you are likely to think twice, given the effort it takes to change.

If you are in the situation that your organisation has since long been able to settle on, say, Modern_Spanish_CI_AS as your single collation, you have my sympathy if you prefer to continue to use that collation for new databases. But beware that the day you get an error report of emojis "not working" may be just around the corner.

The Choice of Data Types – varchar or nvarchar?

In this section, I like to discuss which data types we should use for our string columns.

When it comes to string columns in databases, we can with a bit of generalisation divide them into three groups:

  1. Names. For instance: first/last names of persons, city names, product names, team names – you name it!
  2. Free text. Examples include user-entered comments and notes, product descriptions, error messages in a log table and a lot more.
  3. Codes. Currency codes, state codes etc.

When it comes to the first two, name columns are shorter and are sometimes indexed. Free-text columns tend to be longer and are rarely indexed. However, for this particular discussion they are more or less the same.

For names and free-text columns I have a very strong opinion: They should be stored with a data type that supports the full Unicode range. That is, the data type should be nvarchar or varchar with a UTF‑8 collation. It should never be varchar with a legacy code page. I can hear you say But our system will only run in our country. Please, this is the third decade of the third millennium, and we live in a global world. You may be located in the heart of France, and think that French is all you need to support. But one day your boss says We have been acquired by a company in Slovakia or We have acquired a company in Slovakia, and then the boss adds Can we handle Slovakian names in our database? If you went with varchar with a legacy code page, your answer will have to be No as your cheeks turn red. The next question is likely to be How much work is it to change that? Your answer can't but be Lots! Even if changing the collation to a UTF‑8 collation is a lot less work than switching to nvarchar, changing the collation of a database is still a substantial undertaking.

You could argue that there are exceptions here, like an internal error-log table, because you will write all messages in English. That may be true, but it is not unlikely that you will want to include names and user input in the error messages for better diagnostics. If you have the data type as varchar(MAX) with a legacy code page, that diagnostics may be less helpful when you only see question marks (the generic fallback character).

As for the choice between nvarchar on the one hand and varchar + UTF‑8 collation on the other for a new database, I have already discussed this in the section about UTF‑8 collations, but just to recap: My preference is for nvarchar, since it is more predictable. nvarchar(30) means 30 characters as long as you keep emojis and other supplementary characters out. (Which makes sense for name columns, less so for free-text. But the max length for free-text columns tends to be generous enough, so the exact limit is not equally critical.) With UTF‑8, varchar(30) is 30 bytes, and exactly how many characters you can fit depends on the data.

Furthermore, as we saw in the previous chapter, varchar with UTF‑8 collations seems to perform less well than nvarchar because of the internal conversion to UTF‑16. The counter-argument here is that for text in Latin script, nvarchar takes up more space than UTF‑8. To which I presented the counter-counter-argument that you can achieve the same space reduction with row compression. But in the end, this is definitely a matter of preference. If you settle for varchar + UTF‑8, I will not come after you. If you go for varchar + legacy code page, I will. That is simply not an acceptable choice in my book.

Here is another data point. I run this query in tempdb on a freshly started SQL 2022 instance (so that there are no user-created objects):

SELECT  object_name(object_id), * 
FROM    sys.all_columns  
WHERE   type_name(system_type_id) = 'nvarchar'

It returns over 2500 rows. When I change the type name to varchar, I get back 51 rows. Microsoft obviously thinks that Unicode is a better choice for the system catalogue. (varchar + UTF‑8 is not an option, since the system catalogue was designed long before the UTF‑8 collations appeared on the scene.)

So far about columns for names and free text. Code columns are a completely different matter. Just so that you understand what I'm talking about, let me give some examples of code columns:

This is not an exhaustive list. You can surely think of columns in your database that you would classify as codes, but which does fit with any of the examples in the list above.

There is an interesting observation to make on these codes: Very many of them are restricted to the ASCII character set. In fact, very many of them use only uppercase A to Z and maybe digits. This is particularly true for codes that are part of international standards. With home-brewed codes, characters from the local alphabet may appear. (I will have to plead guilty myself on this point.) But we are still talking about characters that appear in the legacy code page that fits your language.

Thus, for these code columns, there is very rarely any need to have full Unicode support. Therefore, varchar is the best choice, using nvarchar would really be a waste of space (unless you apply row compression). And for these codes, it does not matter if you have UTF‑8 collation or a collation with a legacy code page.

Let me summarise the recommendations so far:

In the next section, I will introduce ideas that go against these recommendations, and particularly the first point. You may find them useful – but you may also find them way too radical.

Collation for Codes

Is really a case- and accent-insensitive collation the best choice for code columns? Say that you have a table called currencies. The primary key is a char(3) column of which the values are currency codes as defined by ISO 4217. That is, you have values like EUR, USD, GBP, SEK etc. You have a number of tables with foreign-key constraints to this table. In these referring tables, do you really ever want to see values like eur, ÉüŘ and so on? I recall what when I looked in a customer database for the system I was working with and saw sek, and I thought What? Lower-case or unexpected accents will only cause confusion in such columns.

Note: some readers may object and say I wouldn't do that. My currencies table would have a numeric surrogate key and the code would only appear in the currencies table. It is true that some people do database design that way (although I'm not a fan of it myself), and it is particularly common in data warehouses. If this is your situation, this discussion may not apply to you; it's more geared for designs where character codes are used as keys, primary and foreign.

There is another problem, albeit local to my part of the world. Say that me and my colleagues settle for Finnish_Swedish_100_CI_AI_SC_UTF8 for our system. We find that we need a lookup table to hold US states. However, we would only have space for 49 states. You may recall that in Swedish W is just a fancy way of writing V, so we would need to squeeze Washington (WA) and Virginia (VA) together into a single slot, despite they are on different coasts! (Well, we could claim that we are taking revenge on all those 'mericans who think that Sweden and Switzerland are the same. :-)

No, for codes, you want a case- and accent-sensitive collation. Sure, users may want to enter usd for searches, but in that case apply upper to the input value. So code columns should have a CS_AS collation? Maybe, but maybe not. This is the point where I recall a forum post from a couple of years back. The poster had a number of three-letter codes, and he was puzzled by the fact that the code AAD sorted at the end of the list and not before ABC. I noted that the poster had a Danish name, and I smiled as I posted my answer. In Danish collations, SQL Server equates AA with Å, the last letter in the Danish alphabet. And similar issues apply to other languages that have digraph letters, even though the effect is not always as drastic as in the Danish example.

There is a way to avoid such mysteries: use a binary collation. That will give the expected order for code columns in most cases. The exception could be if you have codes defined from your native language. Say that in a Turkish system there is a table with codes like OHM, ÖZS, PHI, ZLD. The order given is the expected one for a Turkish user, since Ö sorts between O and P in Turkish, but with a binary collation ÖZS will sort last. So in this example, Turkish_CS_AS would be the better choice.

But for the case where codes are all ASCII, I'd say that binary collations are the best for codes. Apart from being case- and accent-sensitive and sorting without regard to digraph letters, there is one more advantage: you get a small performance boost as string operations on these columns will be faster. To simplify the discussion, I henceforth only talk about binary collations for codes.

If you start to adapt this approach, it is likely that you will find it a little tedious. Not only do you need to specify the COLLATE clause in the CREATE TABLE statements, but you will also need to use it in your temp-table and table-variable definitions to avoid collation conflicts, as these codes tend to creep into temp tables/table variables for intermediate results that you use in further joins. It is true that I have already recommended using COLLATE DATABASE_DEFAULT for temp tables, unless you really can assume a single collation across the organisation. But: 1) That's only temp tables, not table variables. 2) You may still find it easier to type COLLATE DATABASE_DEFAULT than COLLATE whichonewasitagain_BIN2.

Let's take it one step further. What says that the collation of the name and free-text columns should dictate the database collation? Maybe it is the other way round: the database collation should be the collation for codes. That is, the database collation should be binary. At first glance it seems that we are just moving the problem from the code columns to the other columns, but I would suggest that it is not really so. Yes, in CREATE TABLE statements for the tables themselves, we certainly are. But not so much in temp tables and table variables, because there are far fewer joins over names and free-text columns, if any at all. When they appear in temp tables it is often only as carry-on from one intermediate result to the next result. You may want to sort by a name column in the final SELECT, and in that case, you will need a COLLATE clause to get the correct order, but that might be the only place in the procedure where you need it.

Keep in mind that you will still need COLLATE DATABASE_DEFAULT for your code columns in temp tables, unless the server collation is also binary (and that would be a company-wide server collation). However, you are relieved from it in table variables.

I should immediately say that this is not always the right way to go, but it is one of these "it depends". First of all, how many of these codes do you have? If you only have a handful, you are probably better off staying with the non-binary collation as the database collation. My experience, though, is that the code columns outnumber the name- and free-text columns by quite a margin, since they are often used as foreign keys and appear in many tables.

Another aspect is: how many production databases are there for the system? If it is an in-house system with a single production database, there is not much problem with hard-coding a collation for names and free-text. This gets a little problematic if there are multiple installations with different language preferences (for instance because you are an ISV). I would say that it is not an insurmountable roadblock, as you could have a find/replace step in your deployment package to replace the collation in the script with the one preferred at the local site. On the other hand, if you decide to remain with your collation for names and free-text for the database collation, and use a binary collation for your codes, this issue is not equally pressing since all binary collations largely behave the same, so you could consider to use the same binary collation for everyone. I will come back to that in a second.

One effect of going for a binary database collation is that your identifiers will be case-sensitive. Personally, I have worked for so many years in case-sensitive environments that this is my preference. Also, I have presented a recommendation for ISVs to develop with a case-sensitive catalogue collation. Nevertheless, I realise others have different preferences and may find this to be a major obstacle. If that is the case, you can consider using the CATALOG_COLLATION option to get SQL_Latin1_General_CP1_CI_AS your metadata collation, despite my recommendation against it.

So far I have only said "binary collation", without any discussion on which one to use. But as I hinted above, they are largely the same. And if your codes are ASCII only, they are even more the same. Still, I would say that if you are settling for a binary database collation, the best choice is Latin1_General_100_BIN2_UTF8, as it is "future-proof". If someone gets the idea to use Greek letters for codes, you are all set up for it.

On the other hand, if you have decided that you will stay with your regular database collation and use the COLLATE clause for your code columns, typing Latin1_General_100_BIN2_UTF8 again and again, and reading it again and again is a little tiring. In this case, you may prefer to use Thai_BIN as your binary collation, since that is as short as the name of a binary collation can be.

There are some considerations though:

Lastly, I should say that the division into three types of string columns that I introduced is a simplification. The division is not always clearcut, but there are columns that stray between the boundaries. As one example, take email addresses. Email addresses are often all lowercase, and if you store them this way, you can easily handle them as binary. But if users want to see addresses as Firstname.Lastname@company.com with mixed case, I don't think a binary collation will work well. In that case, it is better to consider them to be names and store them with case-insensitive collation. In your database design, you will encounter several such situations, and you will have to make a decision for such data entities on a case-by-case basis.

The ideas put forth in this section certainly are radical and may not be for everyone. The reader may ask if I have implemented these ideas in practice myself. My answer will be: <COUGH> I have not</COUGH>. But if I would ever be in the position to design a new system from scratch, I would absolutely consider this approach.

Conclusion, Acknowledgements and Feedback

This article has been intended both for divertissement and for learning. The divertissement is all the interesting examples from various languages. As for the learnings, here is a brief summary of the most important points in this article that I don't want you to forget.

If you have run the many demos in this article and want to get rid of all the databases named for their database collation, you can use the script drop_demo_databases.sql in the archive collationscripts.zip. It drops all databases where the name and the database collation are the same.

Usually, at this point in my articles, I list people who have provided valuable comments or input in any other form. Because this is the first edition of the article, that list is still empty. However, I need to mention Wikipedia which I have used for research for things I did not know all about myself. This includes many fine details on Unicode, the story about variation selectors, details about languages I do not have knowledge about myself and more.

If you have questions or comments related to this article, feel free to drop me a line on esquel@sommarskog.se. I very much improve suggestions for improvements of the article, and that very much includes pointing out spelling and grammar errors. Since I don't write for a publisher, I have to rely on my readers for proof-reading. Also, if I say something wrong about a language of which you are a speaker – don't hesitate to point that out for me. I will be more than happy to correct.

Revision History

2025-02-26
First version.

Back to my home page.