After reading Jeff Moden’s article “Replace Multiple Spaces with One” describing an approach to collapse white space in a string, I was troubled by the need for a temporary placeholder. I’m generally skeptical of any solution that requires picking a character that doesn’t naturally occur in the data. It just feels like you are building a time-bomb into the app even if you are very careful to pick something so zany it has a very little chance of showing up and causing problems. Also, the character you pick depends on the data you are running this against, so it might not make for a great generic solution.
So today’s project was to find another way to skin that cat without inserting bogus characters into the data.
Statement of The Problem
For those who haven’t read Jeff’s article, here is a basic statement of the task:
Replace any continuous series of repeating spaces of in a database column with a single space.
- Although a CLR UDF using .NET’s regular expression library is the most straightforward way to do this, the original article went for a pure SQL approach so I did the same.
- My solution is based language features available in MS SQL Server 2005 and later.
I used a permanent table to test this instead of a temporary table, but used the same sample data set from Jeff’s article. Also, I inserted these same rows 100K times so I had a large enough data set to be useful for performance testing.
CREATE TABLE SpacesTest( ID int IDENTITY(1,1) NOT NULL, SomeText varchar(max) NULL, ) go DECLARE @loopCounter int set @test=0 WHILE (@loopCounter<100000) BEGIN INSERT INTO spacestest (sometext) SELECT ' This has multiple unknown spaces in it. ' UNION ALL SELECT 'So does this!' UNION ALL SELECT 'As does this' UNION ALL SELECT 'This, that, and the other thing.' UNION ALL SELECT 'This needs no repair.' END
If we could assume that no series of spaces in the string consisted of more than two spaces, this challenge could be met with a simple application of the REPLACE statement to replace all double spaces with single spaces. However, given series of arbitrary length you need would need to repeat this operation until you had collapsed the longer series down to single spaces.
You could do it iteratively like this:
WHILE EXISTS(SELECT * FROM SpacesTest WHERE (SomeText like '% %')) BEGIN UPDATE SpacesTest SET SomeText=REPLACE(SomeText,' ',' ') WHERE (SomeText like '% %') END
While the iterative approach works just fine, it does do of repetitive updates to each row and requires you to write the intermediate output to a temporary table if you just want to query the data rather than updating the source table. I came up with the following recursive solution that doesn’t require the intermediate updates and is more of “set based” approach.
WITH spacesCollapsed (ID,SomeText) AS ( SELECT ID, cast(LTRIM(RTRIM(SomeText)) as varchar(max)) FROM SpacesTest UNION ALL SELECT ID, cast(REPLACE(SomeText,' ',' ') as varchar(max)) as SomeText FROM spacesCollapsed WHERE (SomeText like '% %') ) SELECT ID, SomeText FROM ( SELECT ID,SomeText, ROW_NUMBER() OVER(PARTITION BY ID ORDER BY min(len(sometext))) as GroupRowNum FROM spacesCollapsed GROUP BY ID,SomeText ) as reductions WHERE reductions.GroupRowNum=1;
My new recursive approach did generate valid output and does solve the dilemma of choosing a character that won’t occur in the data. The bad news is that it didn’t perform all that well. In fact, it was over 26x slower than Jeff’s approach. The iterative approach performed a lot better, but was still almost 6x slower than the benchmark. I suspect the performance problems with both of my alternatives come from the need to make multiple passes at the source data using a non-SARGable expression in the where clause.
So, although it was a fun exercise, I have to concede this one. If performance is a major consideration (and when isn’t it?) then Jeff actually has a superior technique. Here is an example query using that technique, just be careful to understand the potential side-effects if you use it.
SELECT REPLACE(REPLACE(REPLACE(LTRIM(RTRIM(sometext)),' ',' '+ CHAR(7)) , CHAR(7)+' ',''), CHAR(7),'') AS CleanString FROM SpacesTest WHERE CHARINDEX(' ',sometext) > 0
Given how dramatic the difference is for a similar operation, I have to think that maybe I am missing an optimization on my recursive solution that could reign in the performance. If you see anything I’m missing that might help my solution in terms of performance please let me know in the comments.