Replacing multiple spaces in a string with a single space in SQL

After reading Jeff Moden’s article “Replace Multiple Spaces with One” describing an approach to collapse white space in a string, I was troubled by the need for a temporary placeholder. I’m generally skeptical of any solution that requires picking a character that doesn’t naturally occur in the data. It just feels like you are building a time-bomb into the app even if you are very careful to pick something so zany it has a very little chance of showing up and causing problems. Also, the character you pick depends on the data you are running this against, so it might not make for a great generic solution.

So today’s project was to find another way to skin that cat without  inserting bogus characters into the data.

Statement of The Problem

For those who haven’t read Jeff’s article, here is a basic statement of the task:

Replace any continuous series of repeating spaces of in a database column with a single space.

Constraints:

  1. Although a CLR UDF using .NET’s regular expression library is the most straightforward way to do this, the original article went for a pure SQL approach so I did the same.
  2. My solution is based language features available in MS SQL Server 2005 and later.

Test Data

I used a permanent table to test this instead of a temporary table, but used the same sample data set from Jeff’s article.  Also, I inserted these same rows 100K times so I had a large enough data set to be useful for performance testing.


CREATE TABLE SpacesTest(
ID int IDENTITY(1,1) NOT NULL,
SomeText varchar(max) NULL,
)

go
DECLARE @loopCounter int
set @test=0

WHILE (@loopCounter<100000)
BEGIN
  INSERT INTO spacestest (sometext)
    SELECT '  This      has multiple   unknown                 spaces in        it.   '
    UNION ALL SELECT 'So                     does                      this!'
    UNION ALL SELECT 'As                                does                        this'
    UNION ALL SELECT 'This, that, and the other  thing.'
    UNION ALL SELECT 'This needs no repair.'
END

Solutions

If we could assume that no series of spaces in the string consisted of more than two spaces, this challenge could be met with a simple application of the REPLACE statement to replace all double spaces with single spaces. However, given series of arbitrary length you need would need to repeat this operation until you had collapsed the longer series down to single spaces.

You could do it iteratively like this:

WHILE EXISTS(SELECT * FROM SpacesTest WHERE (SomeText like '%  %'))
BEGIN
 UPDATE SpacesTest
 SET SomeText=REPLACE(SomeText,'  ',' ')
 WHERE (SomeText like '%  %')
END

While the iterative approach works just fine, it does do of repetitive updates to each row and requires you to write the intermediate output to a temporary table if you just want to query the data rather than updating the source table.  I came up with the following recursive solution that doesn’t require the intermediate updates and is more of  “set based” approach.

Recursive Solution:

WITH spacesCollapsed (ID,SomeText) AS
(
   SELECT ID, cast(LTRIM(RTRIM(SomeText)) as varchar(max)) FROM SpacesTest
   UNION ALL
   SELECT ID, cast(REPLACE(SomeText,'  ',' ') as varchar(max)) as SomeText FROM spacesCollapsed WHERE (SomeText like '%  %')
)
SELECT ID, SomeText
FROM (
   SELECT ID,SomeText,  ROW_NUMBER() OVER(PARTITION BY ID ORDER BY min(len(sometext))) as GroupRowNum
   FROM spacesCollapsed
   GROUP BY ID,SomeText
 ) as reductions
WHERE reductions.GroupRowNum=1;

Conclusions

My new recursive approach did generate valid output and does solve the dilemma of choosing a character that won’t occur in the data. The bad news is that it didn’t perform all that well. In fact, it was over 26x slower than Jeff’s approach. The iterative approach performed a lot better, but was still almost 6x slower than the benchmark. I suspect the performance problems with both of my alternatives come from the need to make multiple passes at the source data using a non-SARGable expression in the where clause.

Query Stats for Jeff's Approach

Jeff's Approach (for comparison)

Query Stats for Recursive Approach

Recursive Approach (Ugh!)

So, although it was a fun exercise, I have to concede this one. If performance is a major consideration (and when isn’t it?) then Jeff actually has a superior technique. Here is an example query using that technique, just be careful to understand the potential side-effects if you use it.

 SELECT REPLACE(REPLACE(REPLACE(LTRIM(RTRIM(sometext)),'  ',' '+ CHAR(7)) , CHAR(7)+' ',''), CHAR(7),'') AS CleanString
 FROM SpacesTest
 WHERE CHARINDEX('  ',sometext) > 0

Feedback

Given how dramatic the difference is for a similar operation, I have to think that maybe I am missing an optimization on my recursive solution that could reign in the performance. If you see anything I’m missing that might help my solution in terms of performance please let me know in the comments.

Creating a unique constraint that ignores nulls in SQL Server

Proper constraints and data validation are the cornerstone of robust applications. They enforce explicit contracts between your database and the applications that rely on them reducing the number of contingencies that your application code needs to account for.

In this article, I address a specific variant of the uniqueness constraint that is a little tricky to implement in Microsoft’s SQL Server which has likely discouraged developers from applying this type of validation even when it is indicated in the logical data  model. I’m not aware of a standardized term, so let’s just call it a “Null agnostic Unique  Constraint.”

Null Agnostic Unique Constraint
A constraint that doesn’t allow duplicate values with the twist that Null values are not considered duplicates of other Null values.

A practical example
This table/design goal is a good  example of a situation where you might need a null agnostic unique constraint.

Customers Table
CustID integer, Primary Key, Identity
SSN varChar(11)
CustName varChar(100)

Validation Requirement: For privacy reasons, not every customer is required to provide their SSN, but the table should reject any duplicate values among customers with this field populated.

The Problem

The core issue behind dealing with nulls as part of a unique index is the ambiguity about what the expression NULL=NULL should evaluate to. In database systems NULL is understood to represent a value that is unknown or undefined. Is one unknown the same as another? That is a matter of debate.

Microsoft, by implementing the SET ANSI_NULLS option and defaulting it to OFF, seems to be taking the seemingly more logical position that you can’t say whether two unknown values are equivalent, and by default evaluates NULL=NULL to NULL.

ANSI, however, takes a contrary view in the SQL-92 standard, in which they specify that both NULL=NULL and NULL<>NULL should evaluate to false. This is further codified in their prescribed approach to the treatment of NULL values in the implementation of unique constraints.

If columns with UNIQUE constraints do not also have NOT NULL constraints, then the columns may contain any number of  NULL values.

Unfortunately, Microsoft didn’t true up SQL Server completely to this standard in their implementation of unique constraints/indexes. Regardless of the ANSI_NULLS setting, SQL Server treats Null as a discrete value when checking for duplicates. That is, it  allows at most a single Null value in a unique column.

Oddly, enough this implementation seems to imply the assumption that NULL=NULL evaluates to True. Further, it is a tad befuddling that MS decided to implement ability to allow NULL values in unique columns, which is optional in the standard, but ignored the hard requirement to not treat multiple NULL values as duplicates.

The bottom line  is that implementing a null agnostic unique constraint on our Customers table won’t be quite as easy as it might be with PostgreSQL or MySQL, which handle duplicate checking on NULLS in accordance with SQL-92.

Solution 1: Filtered Indexes

The first approach is clearly the cleanest and the most performant, but requires the use of filtered indexes which were not introduced until SQL2008.

Under this approach you simply add a unique index to the table on the field that you need to be unique, and specify that you only want to include rows in the index where that field is not null using the new WHERE syntax of the CREATE INDEX command.

Here’s how it would work using the Customers table example.

CREATE UNIQUE NONCLUSTERED INDEX [UniqueExceptNulls]
ON [Customers] (SSN)
WHERE [SSN] IS NOT NULL

Any values inserted or updated into the SSN column will be checked for duplicates unless they are NULL, because NULL values are excluded from the index.

Easy, huh?

Solution 2: Constraint on computed column

If you are using a pre-2008 version of SQL Server, it isn’t quite so easy, but here is an alternate approach that is a passable workaround. It should perform reasonably well, but has the downside of adding the clutter of an extra computed column that won’t make much sense to anyone else that looks at the table.

Here, we create a unique constraint on a computed field that forces the null values to look unique. It does this by evaluating to the value you want to check for uniqueness unless that value is a NULL, in which case it evaluates to a unique value to that row that won’t trip the duplicate check. The easiest way is to base the computed field on an identity field.

Here is the implementation of this approach, again using the Customers table.

CREATE TABLE [Customers] (
  [CustId] INT IDENTITY(1,1) NOT NULL,
  [SSN] varchar(11) NULL,
  [CustName] varchar(100) NOT NULL,
  [SSN_UNQ]  AS (CASE WHEN [SSN] IS NULL
                      THEN cast([CustID] as varchar(12))
                      ELSE '~' + [SSN] END
),
CONSTRAINT [UQSSN_Unique] UNIQUE  ([SSN_UNQ])
)

As you can see, the computed field will contain SSN, unless that field is NULL, in which case it will contain the CustID.

The ~ character is prepended to the SSN just to cover the remote possibility that a CustID might match an existing SSN.

Note: If you are using at least SQL 2005, you might be able to squeeze some extra performance out of this by marking the computed column PERSISTED.

Solution 3: UDF Check Constraint (Not recommended)

I’m mentioning this for the sake of completeness, but with the caveat that this approach could create considerable performance issues if used on any table with a significant amount of INSERT or UPDATE traffic.

Typically a check constraint is used for simple validation based only on values from the row being inserted or updated, but there is a loophole that allows you to access other rows in the table through the use of a User Defined function in the check constraint.

To implement this approach you would just write a UDF that checks for duplicates in the table for a specified value, excluding of course, the current row being updated.

There might be situations where this is preferable to the computed column constraint method, but I can’t think of one offhand. Use this approach at your own peril.

[Implementation omitted to discourage its use.]


1 CREATE UNIQUE NONCLUSTERED  INDEX [UniqueExceptNulls]
2  ON [Customers] (SSN)
3 WHERE [SSN] IS NOT NULL

The Zen of Certification and the Microsoft Certified Master Program

The Revenant MCP

This week I became Microsoft certified this time by passing the SQL Server 2008 Database Development Exam (70-433) which bestows on me the privilege of calling myself a Microsoft Technology Specialist (MCTS).  I’m assuming this is the new and improved version of the Microsoft Certified Professional (MCP) certification that I remember from the last time I paid attention to these things.

Microsoft Certified XBOX 360 Acheivement

It has been almost a decade since I have devoted more than a passing interest in Redmond’s annoyingly mercurial certification programs. I fondly remember the feelings of validation, accomplishment, and relief upon completing the final exam of the MS Certified Solution Developer (MCSD) track one spring afternoon in 1999. I rode that high for several months before the e-mail arrived from Microsoft giving me the “good news” that I hadn’t finished a race, I had only reached the front of the treadmill.

Dear {Insert Name Here},
We think all of you who busted your hump to get your certification are swimmingly awesome! So awesome, in fact, that we are doing you a huge favor. We are going to make your certification even more marketable to employers by upgrading the program. Well, not exactly YOUR certification, but the one you would have if we hadn’t just put an expiration date on all the tests on your transcript.
Isn’t that just marvy?

The realization that certification was a journey and not a destination was a bit of a buzz-kill for me, to say the least. I started to ruminate on my motivations for becoming certified in the first place and ultimately concluded that it was primarily an effort to credential myself for career advancement.

Is it worth it to continually renew these certifications every time MS incremented the annum on their software?

What is an expired certification worth anyway?

Obsolete

As the tests that comprised my MCSD were gradually retired, and I assume my certification status along with it, I began to wonder about how to document my situation on my résumé.  Sure, it was expired, but it doesn’t it mean anything that I ran this gauntlet, even if I did it in last year’s chariot?

Sure you summited Everest, but what have you done lately?

It seemed only fair to continue to mention the cert on my résumé because, after all, I had earned it. I considered adding a qualifier (expired), but that felt like a stain that screamed “I don’t keep up with technology!”

But that wasn’t true. I was keeping up.  I just couldn’t justify the cost and effort to continually re-take the exams. So I didn’t, until now.

So why now?

The whole reason I am even thinking about MS Certifications again is that my employer realized that they were perilously close losing their MS Partner status and the associated software discounts.

They desperately needed to affiliate themselves with some certified SQL experts to meet the requirements of the program, so I volunteered to give up a few evenings making sure I knew in how to properly  punch myself in the junk using the new features in SQL 2008.

The extra effort to brush up apparently paid off because I scored a respectable 925 (of 1000) on the exam despite the fairly extensive and complex content of the exam that often ventured beyond the features of the platform relevant to developers.

Advice for those taking the 70-433 exam

Microsoft SQL Server 2008 Database Development Training KitThe primary resource I used to prepare for this exam was the MCTS Self-Paced Training Kit (Exam 70-433) from Microsoft Press.

I found the book did a pretty poor job at explaining some of the material and went in to way more detail on some topics than was necessary to prepare for the test. I consulted to various blogs and MSDN articles when I found the explanations from the book too convoluted  or terse to follow (i.e. frequently).

For example, the section introducing Common Table Expression uses an unnecessarily complex query involving multiple joins that made the example too noisy and harder to mentally parse out the syntax of CTEs.

The true value of the book was the framework it provided for studying for the exam than the narrative in the lessons., but the included practice exam, which was way harder than the real one, and the included 15% discount code for the exam registration fee were nice extras.

Some general tips to prepare for this exam:

  • Focus on the new stuff: New features/changes in 2005/2008 are disproportionately represented on the exam.
  • Hands on Practice is critical: Implement at least one example of each concept against a database you are familiar with, not just AdventureWorks.
  • Take the practice exam early: Even if you do horribly, it gives you a list links to MSDN articles for the topics you need the most work on. I wish I had seen it earlier.
  • The practice exam is MUCH harder than the real one: I never broke 65% on the practice exam, but still got 92.5% on the real one.

OCD (Obsessive Completionist Disorder) Takes Root

A few days after passing the exam, I get another of those congratulatory e-mails from Microsoft with a link to the special MCP site where I can see all the perks associated with certification.

That’s all fine until I notice the “Certification Planner” link that informs me that I am one measly test away from the “Microsoft Certified IT Professional” level.MCTS is for schmucks who can’t commit.

This “you are on step 3 of 4” type marketing scheme, is a particular weakness of mine. It is the sole reason I am completely addicted to Mafia Wars, despite the fact that it is a completely pointless game with no action or any real strategy to it.

Mafia Wars

27% ??? That Just won't do!

This is exactly why I’ll probably take that other test regardless of the fact that it really provides very little, if any, additional wow factor to my resume.

Somewhere in the Monk area of my brain, I just know it is a terrible thing for me to be 50% of the way to a MCITP certification for SQL, despite the fact that was perfectly happy at 0%.

What’s the harm anyway? My employer will probably pick up the tab for the exam fees and preparation materials and I always learn something new while preparing for these. Right?

Then I stumbled upon this…

The Microsoft Certified Master (MCM) Program

… crap …

The two tests for the MCITP certification along with 10 years of IT experience and 5 years of SQL experience make you eligible to apply to the mother of all  Microsoft certifications (MOAMC).

Interesting, tell me more…

I need to send a resume and get approved to even try for the certification?
That sounds super exclusive, cool!

A mandatory three week training program on-site in Redmond?
That might be a tough sell for my boss, but it would be cool to get a peek inside MS HQ.
Act now and get a $3,550 discount on the registration fee.
Wait a second, how much is that registration fee exactly?

Registration fee: $18,500.
HRUH-ZAH-WHOOZIT?!?!

Ready to Become A Master?

This MCM ad is a "save the queen" away from being an Evony ad.

Yeah. I don’t think my boss is gonna go for that, even with the helpful ROI calculation they link to in the FAQ. The boss does ROI calculations too and knows the smell of his own BS, and will probably recognize the scent of someone else’s.

So I am gonna be on my own dime and use all my days off if I want to pursue this? Or perhaps I could spend the same money and get an MBA at a mid-level school, which has a much better chance of increasing my earning potential and won’t expire when SQL 2010 is in vogue. It seems kind of like getting a phD in VCR repair.

The comparison to a college degree doesn’t stop there. Tell me this doesn’t sound eerily similar to a Master’s thesis:

The time it takes varies, depending on the candidate. However, the estimated time for fully qualified architects to prepare their documentation and prepare for the Review Board interview process is typically 80 to 120 hours, over a period of three to six months

Who is getting this MCM Certification?

It does appear that this is a pretty exclusive club based Microsoft’s fluff-laden marketing-speak description of the corpus of MCX recipients.

Worldwide, more than 300 Microsoft Certified Masters (MCMs) and Microsoft Certified Architects (MCAs) specialize in specific technologies, and more than 125 specialize in infrastructure and solutions. Those who hold these industry-leading certifications live and work in many countries and regions, including the United States, Europe, Latin America, and Asia Pacific. All have varied backgrounds, interests, and extensive experience.

In fact, on the very same page it appears to list every holder of this certification, which gives some interesting insight into the target candidates for the program.

Just from an eyeball estimate, it appears that MS employees comprise around 80% of the people who have this level of certification. I’m assuming they get a substantial discount beyond the $3,550 special.

For now, I’m gonna have to put this on the “Fat Chance” wish-list, and settle for the MCAD program or something more economical.

What do you think?

Is this certification the least bit enticing to you?
Have you even heard of the MCM or MCA programs, and if not is it worth the money if you have to explain it to a potential employer?