Replacing multiple spaces in a string with a single space in SQL


After reading Jeff Moden’s article “Replace Multiple Spaces with One” describing an approach to collapse white space in a string, I was troubled by the need for a temporary placeholder. I’m generally skeptical of any solution that requires picking a character that doesn’t naturally occur in the data. It just feels like you are building a time-bomb into the app even if you are very careful to pick something so zany it has a very little chance of showing up and causing problems. Also, the character you pick depends on the data you are running this against, so it might not make for a great generic solution.

So today’s project was to find another way to skin that cat without  inserting bogus characters into the data.

Statement of The Problem

For those who haven’t read Jeff’s article, here is a basic statement of the task:

Replace any continuous series of repeating spaces of in a database column with a single space.

Constraints:

  1. Although a CLR UDF using .NET’s regular expression library is the most straightforward way to do this, the original article went for a pure SQL approach so I did the same.
  2. My solution is based language features available in MS SQL Server 2005 and later.

Test Data

I used a permanent table to test this instead of a temporary table, but used the same sample data set from Jeff’s article.  Also, I inserted these same rows 100K times so I had a large enough data set to be useful for performance testing.


CREATE TABLE SpacesTest(
ID int IDENTITY(1,1) NOT NULL,
SomeText varchar(max) NULL,
)

go
DECLARE @loopCounter int
set @test=0

WHILE (@loopCounter<100000)
BEGIN
  INSERT INTO spacestest (sometext)
    SELECT '  This      has multiple   unknown                 spaces in        it.   '
    UNION ALL SELECT 'So                     does                      this!'
    UNION ALL SELECT 'As                                does                        this'
    UNION ALL SELECT 'This, that, and the other  thing.'
    UNION ALL SELECT 'This needs no repair.'
END

Solutions

If we could assume that no series of spaces in the string consisted of more than two spaces, this challenge could be met with a simple application of the REPLACE statement to replace all double spaces with single spaces. However, given series of arbitrary length you need would need to repeat this operation until you had collapsed the longer series down to single spaces.

You could do it iteratively like this:

WHILE EXISTS(SELECT * FROM SpacesTest WHERE (SomeText like '%  %'))
BEGIN
 UPDATE SpacesTest
 SET SomeText=REPLACE(SomeText,'  ',' ')
 WHERE (SomeText like '%  %')
END

While the iterative approach works just fine, it does do of repetitive updates to each row and requires you to write the intermediate output to a temporary table if you just want to query the data rather than updating the source table.  I came up with the following recursive solution that doesn’t require the intermediate updates and is more of  “set based” approach.

Recursive Solution:

WITH spacesCollapsed (ID,SomeText) AS
(
   SELECT ID, cast(LTRIM(RTRIM(SomeText)) as varchar(max)) FROM SpacesTest
   UNION ALL
   SELECT ID, cast(REPLACE(SomeText,'  ',' ') as varchar(max)) as SomeText FROM spacesCollapsed WHERE (SomeText like '%  %')
)
SELECT ID, SomeText
FROM (
   SELECT ID,SomeText,  ROW_NUMBER() OVER(PARTITION BY ID ORDER BY min(len(sometext))) as GroupRowNum
   FROM spacesCollapsed
   GROUP BY ID,SomeText
 ) as reductions
WHERE reductions.GroupRowNum=1;

Conclusions

My new recursive approach did generate valid output and does solve the dilemma of choosing a character that won’t occur in the data. The bad news is that it didn’t perform all that well. In fact, it was over 26x slower than Jeff’s approach. The iterative approach performed a lot better, but was still almost 6x slower than the benchmark. I suspect the performance problems with both of my alternatives come from the need to make multiple passes at the source data using a non-SARGable expression in the where clause.

Query Stats for Jeff's Approach

Jeff's Approach (for comparison)

Query Stats for Recursive Approach

Recursive Approach (Ugh!)

So, although it was a fun exercise, I have to concede this one. If performance is a major consideration (and when isn’t it?) then Jeff actually has a superior technique. Here is an example query using that technique, just be careful to understand the potential side-effects if you use it.

 SELECT REPLACE(REPLACE(REPLACE(LTRIM(RTRIM(sometext)),'  ',' '+ CHAR(7)) , CHAR(7)+' ',''), CHAR(7),'') AS CleanString
 FROM SpacesTest
 WHERE CHARINDEX('  ',sometext) > 0

Feedback

Given how dramatic the difference is for a similar operation, I have to think that maybe I am missing an optimization on my recursive solution that could reign in the performance. If you see anything I’m missing that might help my solution in terms of performance please let me know in the comments.

About these ads

One Response

  1. No, you’re not missing a thing. Recursion provides absolutely horrible performance especially in Recursive CTE’s. Please see the following article for why.

    http://www.sqlservercentral.com/articles/T-SQL/74118/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 132 other followers

%d bloggers like this: