Proper constraints and data validation are the cornerstone of robust applications. They enforce explicit contracts between your database and the applications that rely on them reducing the number of contingencies that your application code needs to account for.
In this article, I address a specific variant of the uniqueness constraint that is a little tricky to implement in Microsoft’s SQL Server which has likely discouraged developers from applying this type of validation even when it is indicated in the logical data model. I’m not aware of a standardized term, so let’s just call it a “Null agnostic Unique Constraint.”
Null Agnostic Unique Constraint
A constraint that doesn’t allow duplicate values with the twist that Null values are not considered duplicates of other Null values.
A practical example
This table/design goal is a good example of a situation where you might need a null agnostic unique constraint.
|CustID||integer, Primary Key, Identity|
Validation Requirement: For privacy reasons, not every customer is required to provide their SSN, but the table should reject any duplicate values among customers with this field populated.
The core issue behind dealing with nulls as part of a unique index is the ambiguity about what the expression NULL=NULL should evaluate to. In database systems NULL is understood to represent a value that is unknown or undefined. Is one unknown the same as another? That is a matter of debate.
Microsoft, by implementing the SET ANSI_NULLS option and defaulting it to OFF, seems to be taking the seemingly more logical position that you can’t say whether two unknown values are equivalent, and by default evaluates NULL=NULL to NULL.
ANSI, however, takes a contrary view in the SQL-92 standard, in which they specify that both NULL=NULL and NULL<>NULL should evaluate to false. This is further codified in their prescribed approach to the treatment of NULL values in the implementation of unique constraints.
If columns with UNIQUE constraints do not also have NOT NULL constraints, then the columns may contain any number of NULL values.
Unfortunately, Microsoft didn’t true up SQL Server completely to this standard in their implementation of unique constraints/indexes. Regardless of the ANSI_NULLS setting, SQL Server treats Null as a discrete value when checking for duplicates. That is, it allows at most a single Null value in a unique column.
Oddly, enough this implementation seems to imply the assumption that NULL=NULL evaluates to True. Further, it is a tad befuddling that MS decided to implement ability to allow NULL values in unique columns, which is optional in the standard, but ignored the hard requirement to not treat multiple NULL values as duplicates.
The bottom line is that implementing a null agnostic unique constraint on our Customers table won’t be quite as easy as it might be with PostgreSQL or MySQL, which handle duplicate checking on NULLS in accordance with SQL-92.
Solution 1: Filtered Indexes
The first approach is clearly the cleanest and the most performant, but requires the use of filtered indexes which were not introduced until SQL2008.
Under this approach you simply add a unique index to the table on the field that you need to be unique, and specify that you only want to include rows in the index where that field is not null using the new WHERE syntax of the CREATE INDEX command.
Here’s how it would work using the Customers table example.
CREATE UNIQUE NONCLUSTERED INDEX [UniqueExceptNulls] ON [Customers] (SSN) WHERE [SSN] IS NOT NULL
Any values inserted or updated into the SSN column will be checked for duplicates unless they are NULL, because NULL values are excluded from the index.
Solution 2: Constraint on computed column
If you are using a pre-2008 version of SQL Server, it isn’t quite so easy, but here is an alternate approach that is a passable workaround. It should perform reasonably well, but has the downside of adding the clutter of an extra computed column that won’t make much sense to anyone else that looks at the table.
Here, we create a unique constraint on a computed field that forces the null values to look unique. It does this by evaluating to the value you want to check for uniqueness unless that value is a NULL, in which case it evaluates to a unique value to that row that won’t trip the duplicate check. The easiest way is to base the computed field on an identity field.
Here is the implementation of this approach, again using the Customers table.
CREATE TABLE [Customers] ( [CustId] INT IDENTITY(1,1) NOT NULL, [SSN] varchar(11) NULL, [CustName] varchar(100) NOT NULL, [SSN_UNQ] AS (CASE WHEN [SSN] IS NULL THEN cast([CustID] as varchar(12)) ELSE '~' + [SSN] END ), CONSTRAINT [UQSSN_Unique] UNIQUE ([SSN_UNQ]) )
As you can see, the computed field will contain SSN, unless that field is NULL, in which case it will contain the CustID.
The ~ character is prepended to the SSN just to cover the remote possibility that a CustID might match an existing SSN.
Note: If you are using at least SQL 2005, you might be able to squeeze some extra performance out of this by marking the computed column PERSISTED.
Solution 3: UDF Check Constraint (Not recommended)
I’m mentioning this for the sake of completeness, but with the caveat that this approach could create considerable performance issues if used on any table with a significant amount of INSERT or UPDATE traffic.
Typically a check constraint is used for simple validation based only on values from the row being inserted or updated, but there is a loophole that allows you to access other rows in the table through the use of a User Defined function in the check constraint.
To implement this approach you would just write a UDF that checks for duplicates in the table for a specified value, excluding of course, the current row being updated.
There might be situations where this is preferable to the computed column constraint method, but I can’t think of one offhand. Use this approach at your own peril.
[Implementation omitted to discourage its use.]