How to diff two folders from a Windows command prompt

In most cases where I need to compare two folders recursively on a Windows system I use my go-to tool Beyond Compare. It is an excellent utility, and one that I think should be among the first utilities any developer should install on a new machine.

However, today I was doing a reconciliation as part of a very large file migration project that required comparing two folders that each contained hundreds of millions of files spread across thousands of sub-folders. BC was having a lot of trouble and choked on many of my comparisons. It just wasn’t the tool for today’s job. I needed another solution.

Necessity mothered some invention and I found an inventive way to use a combination of command switches on  RoboCopy to perform the comparison. If you are not familiar with RoboCopy, and you do a lot of mass copying of files, you need to stop what you are doing and learn about it pronto. It is a supercharged version of XCopy that has been included with Windows since Vista. It has a ton of great features such as multi-threaded file copying, selectively copying changed files, and resumable copies that make it a must especially for big file copy jobs over flaky network connections.

Diff Command Using RoboCopy

So here’s the command to perform a basic comparison of two folders and write a log file listing the differences.

ROBOCOPY “\\FileShare\SourceFolder” “\\FileShare\ComparisonFolder” /e /l /ns /njs /njh /ndl /fp /log:reconcile.txt

Explanation of the command switches used above:

/e  Recurse through sub-directories (including empty ones)
/l  Don’t modify or copy files, log differences only
/fp  Include the full path of files in log (only necessary if you omit /ndl)
/ns  Don’t include file sizes in log
/ndl  Don’t include folders in log
/njs   Don’t include Job Summary
/njh   Don’t include Job Header
/log:reconcile.txt   Write log to reconcile.txt (Recreate if exists)
/log+: reconcile.txt   (Optional variant) Write log to reconcile.txt (Append if exists)

Usage Notes and Warnings Regarding the /NDL Option

The /NDL option is a handy way to suppress the inclusion of every folder checked (regardless of whether it contains differences) in the log, but there because of the way it works it is not a good idea in all circumstances. Consider the following before you use /NDL.

  • Folders that exist only on source or destination are not logged unless at least one mismatched file is present or a source file is missing on destination.
  • Folders that exist only on the destination are not logged at all regardless of contents.

If you omit the /NDL option, it is necessary to include the /FP option if you want full paths listed for each file.

Example Output

(with /NDL option)

*EXTRA File         c:\dest\log.txt
New File              c:\source\newfolder\Blah.txt
Newer                 c:\source\Files\CONCORD.DAT
New File              c:\source\Files\COWCO.DAT

(without NDL Option)

 c:\work\test\source\    (extraneous folder listing)
*EXTRA Dir      c:\dest\newfolderdest\
*EXTRA Dir      c:\dest\newfolderrestempty\
*EXTRA File     c:\dest\log.txt
New Dir           c:\source\newfolder\
New File          c:\source\newfolder\Blah.txt
New Dir           c:\source\newfolderempty\
c:\source\Files\   (extraneous folder listing)
Newer             c:\source\Files\CONCORD.DAT
New File          c:\source\Files\COWCO.DAT
c:\test\source\FilesSame\   (Included despite no diffs)

5 Best Practices for Commenting Your Code

One of the first things you learn to do incorrectly as a programmer is commenting your code. My experience with student and recently graduated programmers tells me that college is a really good place to learn really bad code commenting techniques. This is just one of those areas where in-theory and in-practice don’t align well.

There are two factors working against you learning good commenting technique in college.

  1.  Unlike the real world, you do a lot of small one-off projects as a solo developer.  There’s no one out there fantasizing about dropping a boulder on you for making them decipher your coding atrocity.
  2.  That commenting style you are emulating from your textbook is only a good practice when the comments are intended for a student learning to program. It is downright annoying to professional programmers.

These tips are primarily intended for upstart programmers who are transitioning into the real world of programming, and hopefully will prevent a few from looking quite so n00bish during their first code review. Code Review? Oh yeah, that’s something else they didn’t teach you in school, but that’s a whole other article, I’ll defer to Jason Cohen on that one.

So let’s get started…

(1) Comments are not subtitles

It’s easy to project your own worldview that code is a foreign language understood only by computers, and that you are doing the reader a service by explaining what each line does in some form of human language. Or perhaps you are doing it for the benefit of that non-programmer manager who will certainly want to read your code (Spoiler: He won’t).

Look, in the not too distant future, you will be able to read code almost as easily as your native language, and everyone else who will even glance at it almost certainly already can. By then you will realize how silly it is to write comments like these:

// Loop through all bananas in the bunch
foreach(banana b in bunch) {
    monkey.eat(b);  //make the monkey eat one banana
}

You may have been taught to program by first writing  pseudo-code comments then writing the real code into that wire-frame. This is a perfectly reasonable approach for a novice programmer. Just be sure to replace the comments with the code, and don’t leave them in there.

Computer: Enemy is matching velocity.
Gwen DeMarco: The enemy is matching velocity!
Sir Alexander Dane: We heard it the first time!
Gwen DeMarco: Gosh, I’m doing it. I’m repeating the darn computer!

-Galaxy Quest

Exceptions:

  • Code examples used to teach a concept or new programming language.
  • Programming languages that aren’t remotely human readable (Assembly, Perl)

(2) Comments are not an art project

This is a bad habit propagated by code samples in programing books and open source copyright notices that are desperate to make you pay attention to them.

/*
   _     _      _     _      _     _      _     _      _     _      _     _
  (c).-.(c)    (c).-.(c)    (c).-.(c)    (c).-.(c)    (c).-.(c)    (c).-.(c)
   / ._. \      / ._. \      / ._. \      / ._. \      / ._. \      / ._. \
 __\( Y )/__  __\( Y )/__  __\( Y )/__  __\( Y )/__  __\( Y )/__  __\( Y )/__
(_.-/'-'\-._)(_.-/'-'\-._)(_.-/'-'\-._)(_.-/'-'\-._)(_.-/'-'\-._)(_.-/'-'\-._)
   || M ||      || O ||      || N ||      || K ||      || E ||      || Y ||
 _.' `-' '._  _.' `-' '._  _.' `-' '._  _.' `-' '._  _.' `-' '._  _.' `-' '._
(.-./`-'\.-.)(.-./`-'\.-.)(.-./`-'\.-.)(.-./`-'\.-.)(.-./`-'\.-.)(.-./`-'\.-.)
 `-'     `-'  `-'     `-'  `-'     `-'  `-'     `-'  `-'     `-'  `-'     `-'

                 -It's Monkey Business Time! (Version 1.5)
*/

Why, that’s silly. You’d never do something so silly in your comments.

ORLY? Does this look familiar?

   +------------------------------------------------------------+
   | Module Name: classMonkey                                   |
   | Module Purpose: emulate a monkey                           |
   | Inputs: Bananas                                              |
   | Outputs: Grunts                                            |
   | Throws: Poop                                               |
   +------------------------------------------------------------+

Programmers love to go “touch up” their code to make it look good when their brain hurts and they want something easy to do for a while. It may be a waste of time, but at least they are wasting it during periods where they wouldn’t be productive anyway.

The trouble is that it creates a time-wasting maintenance tax imposed on anyone working with the code in the future just to keep the pretty little box intact when the text ruins the symmetry of it. Even programmers who hate these header blocks tend to take the time to maintain them because they like consistency and every other method in the project has one.

How much is it bugging you that the right border on that block is misaligned? Yeah. That’s the point.

(3) Header Blocks: Nuisance or Menace?

This one is going to be controversial, but I’m holding my ground. I don’t like blocks of header comments at the top of every file, method or class.

Not in a boat, not with a goat.
Why? Well let me tell you, George McFly…

They are enablers for badly named objects/methods – Of course, header blocks aren’t the cause for badly named identifiers, but they are an easy excuse to not  put in the work to come up with meaningful names, an often deceptively difficult task. It provides too much slack to just assume the consumer can just read the “inline documentation” to solve the mystery of what the DoTheMonkeyThing method is all about.

JohnFx’s Commandment:
The consumer of thy code should never have to see its source code to use it, not even the comments.

They never get updated: We all know that methods are supposed to remain short and sweet, but real life gets in the way and before you know it you have a 4K line class and the header block is scrolled off of the screen in the IDE 83% of the time. Out of sight, out of mind, never updated.

The bad news is that they are usually out of date. The good news is that people rarely read them so the opportunity for confusion is mitigated somewhat. Either way, why waste your time on something that is more likely to hurt than help?

JohnFx’s Maxim of Plagiarized Ideas :
Bad Documentation is worse than no documentation.

Exception: Some languages (Java/C#) have tools that can digest specially formatted header block comments into documentation or Intellisense/Autocomplete hints. Still, remember rule (2) and stick to the minimum required by the tool and draw the line at creating any form of ASCII art.

(4) Comments are not source control

This issue is so common that I have to assume that programmers (a) don’t know how to use source control; or  (b) don’t trust it.

Archetype 1: “The Historian”

     // method name: pityTheFoo (included for the sake of redundancy)
     // created: Feb 18, 2009 11:33PM
     // Author: Bob
     // Revisions: Sue (2/19/2009) - Lengthened monkey's arms
     //            Bob (2/20/2009) - Solved drooling issue

     void pityTheFoo() {
          ...
     }

The programmers involved in the evolution of this method probably checked this code into a source control system designed to track the change history of every file, but decided to clutter up the code anyway. These same programmers more than likely always leave the Check-In Comments box empty on their commits.

I think I hate this type of comment worst of all, because it imposes a duty on other programmers to keep up the tradition of duplicating effort and wasting time maintaining this chaff. I almost always delete this mess from any code I touch without an ounce of guilt. I encourage you to do the same.

Archetype 2: “The Code Hoarder”


     void monkeyShines() {
          if (monkeysOnBed(Jumping).count > max) {
             monkeysOnBed.breakHead();

             // code removed, smoothie shop closed.
             // leaving it in case a new one opens.
             // monkeysOnBed.Drink(BananaSmoothie);
          }
     }

Another feature of any tool that has any right to call itself a SCM is the ability to recover old versions of code, including the parts you removed. If you want to be triple super extra sure, create a branch to help you with your trust issues.

(5) Comments are a code smell

Comments are little signposts in your code explaining it to future archaeologists that desperately need to understand how 21st century man sorted lists of purchase orders.

Unfortunately, as Donald Norman explained so brilliantly in The Design of Everyday Things, things generally need signs because their affordances have failed. In plain English, when you add a comment you are admitting that you have written code that doesn’t communicate its purpose well.

Sign:”This is a mop sink.” Why would that be necces… oh.

Despite what your prof told you in college, a high comment to code ratio is not a good thing.  I’m not saying to avoid them completely, but if you have a 1-1 or even a 5-1 ratio of LOC to comments, you are probably overdoing it. The need for excessive comments is a good indicator that your code needs refactoring.

Whenever you think, “This code needs a comment” follow that thought with, “How could I modify the code so its purpose is obvious?”
Talk with your code, not your comments.

Technique 1: Use meaningful identifiers and constants (even if they are single use)

     // Before
     // Calculate monkey's arm length
     // using its height and the magic monkey arm ratio
     double length = h * 1.845; //magic numbers are EVIL!

    // After - No comment required
    double armLength = height * MONKEY_ARM_HEIGHT_RATIO;

Technique 2: Use strongly typed input and output parameters

      // Before
      // input parameter monkeysToFeed:
      // DataSet with one table containing two columns
      //     MonkeyID (int) the monkey to feed
      //     MonkeyDiet (string) that monkey's diet
      void feedMonkeys(DataSet monkeysToFeed) {
      }

     //  After: No comment required
     void feedMonkeys(Monkeys monkeysToFeed) {
     }

Technique 3: Extract commented blocks into another method

      // Before

      // Begin: flip the "hairy" bit on monkeys
      foreach(monkey m in theseMonkeys) {
          // 5-6 steps to flip bit.
      }
      // End: flip the "hairy" bit on monkeys

     // After No comment required
     flipHairyBit(theseMonkeys);

As an added bonus, technique 3 will tend to reduce the size of your methods and minimizing the nesting depth (see also “Flattening Arrow Code”) all of which contribute to eliminating the need for commenting the closing tags of blocks like this:

            } // ... if see evil
         } // ... while monkey do.
      } // ... if monkey see.
    } // ... class monkey
  } // ... namespace primate

Acknowledgments

Several of the ideas presented here, and a good deal of the fundamental things I know about programming as part of a team, I learned from the book Code Complete by Steve McConnell. If you are a working programmer and have not read this book yet, stop what you are doing and read it before you write another line of code.

Quick Tip: Comparing a .NET string to multiple values

The Problem

Do you ever wish you could use the SQL IN operator in your C# code to make your conditional blocks more concise and your code easier to read?

Perhaps it’s just my persnickety nature, but I believe that line-wrapped conditional expressions like this are a code smell.

if (animal.Equals("Cow") ||
   animal.Equals("Horse") ||
   animal.Equals("Hen"))
{
   Console.WriteLine("We must be on the farm.");
}

This would be so much cleaner…

if (animal.CompareMultiple("Cow","Horse","Hen")
{
   Console.WriteLine("We must be on the farm.");
}

The Code

With a simple extension class you can upgrade your string classes to do this very thing.

Step 1: Create an extension class as demonstrated here.

C#

namespace extenders.strings
{
  public static class StringExtender {

    public static bool CompareMultiple(this string data, StringComparison compareType, params string[] compareValues) {
      foreach (string s in compareValues) {
        if (data.Equals(s, compareType)) {
          return true;
        }
      }
      return false;
   }
  }
}

VB.NET

Imports System.Runtime.CompilerServices

Namespace Extenders.Strings

  Public Module StringExtender

    <Extension()> _
    Public Function CompareMultiple(ByVal this As String, compareType As StringComparison, ParamArray compareValues As String()) As Boolean

       Dim s As String

       For Each s In compareValues
         If (this.Equals(s, compareType)) Then
            Return True
         End If
       Next s

       Return False
    End Function

  End Module

End Namespace

Step 2: Add a reference to the extension namespace and use it.

C#

using extenders.strings;

namespace MyProgram
{
  static class program {
    static void Main() {

      string foodItem = "Bacon";

      if (foodItem.CompareMultiple(StringComparison.CurrentCultureIgnoreCase, "bacon", "eggs", "biscuit")) {
         System.Console.WriteLine("Breakfast!");
      }
      else {
         System.Console.Write("Dinner");
      }

    }
  }
}

VB.NET

Imports StringExtenderExampleVB.Extenders.Strings

Module Program

  Sub Main()
    Dim foodItem As String = "Bacon"

    If (foodItem.CompareMultiple(System.StringComparison.CurrentCultureIgnoreCase, "bacon", "eggs", "biscuit")) Then
      System.Console.WriteLine("Breakfast!")
    Else
      System.Console.Write("Dinner")
    End If

    System.Console.ReadLine()
  End Sub

End Module

Replacing multiple spaces in a string with a single space in SQL

After reading Jeff Moden’s article “Replace Multiple Spaces with One” describing an approach to collapse white space in a string, I was troubled by the need for a temporary placeholder. I’m generally skeptical of any solution that requires picking a character that doesn’t naturally occur in the data. It just feels like you are building a time-bomb into the app even if you are very careful to pick something so zany it has a very little chance of showing up and causing problems. Also, the character you pick depends on the data you are running this against, so it might not make for a great generic solution.

So today’s project was to find another way to skin that cat without  inserting bogus characters into the data.

Statement of The Problem

For those who haven’t read Jeff’s article, here is a basic statement of the task:

Replace any continuous series of repeating spaces of in a database column with a single space.

Constraints:

  1. Although a CLR UDF using .NET’s regular expression library is the most straightforward way to do this, the original article went for a pure SQL approach so I did the same.
  2. My solution is based language features available in MS SQL Server 2005 and later.

Test Data

I used a permanent table to test this instead of a temporary table, but used the same sample data set from Jeff’s article.  Also, I inserted these same rows 100K times so I had a large enough data set to be useful for performance testing.


CREATE TABLE SpacesTest(
ID int IDENTITY(1,1) NOT NULL,
SomeText varchar(max) NULL,
)

go
DECLARE @loopCounter int
set @test=0

WHILE (@loopCounter<100000)
BEGIN
  INSERT INTO spacestest (sometext)
    SELECT '  This      has multiple   unknown                 spaces in        it.   '
    UNION ALL SELECT 'So                     does                      this!'
    UNION ALL SELECT 'As                                does                        this'
    UNION ALL SELECT 'This, that, and the other  thing.'
    UNION ALL SELECT 'This needs no repair.'
END

Solutions

If we could assume that no series of spaces in the string consisted of more than two spaces, this challenge could be met with a simple application of the REPLACE statement to replace all double spaces with single spaces. However, given series of arbitrary length you need would need to repeat this operation until you had collapsed the longer series down to single spaces.

You could do it iteratively like this:

WHILE EXISTS(SELECT * FROM SpacesTest WHERE (SomeText like '%  %'))
BEGIN
 UPDATE SpacesTest
 SET SomeText=REPLACE(SomeText,'  ',' ')
 WHERE (SomeText like '%  %')
END

While the iterative approach works just fine, it does do of repetitive updates to each row and requires you to write the intermediate output to a temporary table if you just want to query the data rather than updating the source table.  I came up with the following recursive solution that doesn’t require the intermediate updates and is more of  “set based” approach.

Recursive Solution:

WITH spacesCollapsed (ID,SomeText) AS
(
   SELECT ID, cast(LTRIM(RTRIM(SomeText)) as varchar(max)) FROM SpacesTest
   UNION ALL
   SELECT ID, cast(REPLACE(SomeText,'  ',' ') as varchar(max)) as SomeText FROM spacesCollapsed WHERE (SomeText like '%  %')
)
SELECT ID, SomeText
FROM (
   SELECT ID,SomeText,  ROW_NUMBER() OVER(PARTITION BY ID ORDER BY min(len(sometext))) as GroupRowNum
   FROM spacesCollapsed
   GROUP BY ID,SomeText
 ) as reductions
WHERE reductions.GroupRowNum=1;

Conclusions

My new recursive approach did generate valid output and does solve the dilemma of choosing a character that won’t occur in the data. The bad news is that it didn’t perform all that well. In fact, it was over 26x slower than Jeff’s approach. The iterative approach performed a lot better, but was still almost 6x slower than the benchmark. I suspect the performance problems with both of my alternatives come from the need to make multiple passes at the source data using a non-SARGable expression in the where clause.

Query Stats for Jeff's Approach

Jeff's Approach (for comparison)

Query Stats for Recursive Approach

Recursive Approach (Ugh!)

So, although it was a fun exercise, I have to concede this one. If performance is a major consideration (and when isn’t it?) then Jeff actually has a superior technique. Here is an example query using that technique, just be careful to understand the potential side-effects if you use it.

 SELECT REPLACE(REPLACE(REPLACE(LTRIM(RTRIM(sometext)),'  ',' '+ CHAR(7)) , CHAR(7)+' ',''), CHAR(7),'') AS CleanString
 FROM SpacesTest
 WHERE CHARINDEX('  ',sometext) > 0

Feedback

Given how dramatic the difference is for a similar operation, I have to think that maybe I am missing an optimization on my recursive solution that could reign in the performance. If you see anything I’m missing that might help my solution in terms of performance please let me know in the comments.

How to join on memo fields in Microsoft Access

Rambling Intro, Nostalgia, and Crankiness

This week I got a request troubleshoot a legacy Microsoft Access application that has been floating around our company for ages, but still gets used daily because dang it, it does the job and always has. Seems like most companies that are standardized on MS Office have a few of these lurking out on the network.

Earlier in my career I did a ton of work in MS Access and have garnered a reputation within my company for being an expert in this oft maligned platform so I got the call to look into the problem. It had been quite a while since I’d done any real work on MS Access and I’d forgotten about how quirky it could be. Also, I am more than a little disappointed at how Microsoft has mangled the UI of my old friend Access in the 2007 version. It is almost painful to work with it as a power-user in the current incarnation.

The Problem

So anyway, the issue turned out to be that someone increased the length of a field in the underlying SQL Server table linked into the Access application. They increased it past the magical border (255 characters) between what Access considers a text and a memo field, which imposed new limits on how it could be used. In particular, Access doesn’t allow either end of a join in a query to be a memo field.

Can't join on memo fields

This won't fly, McFly

The Solution

The solution is painfully simple. So much so that I have to wonder why Access doesn’t just do it behind the scenes. Perhaps it is just trying to discourage you from building databases that link on big text fields for your own good (see “The Caveat” below).

The trick is to move the join into the WHERE clause of the query  like so:

SELECT Table1.*, Table2.* FROM Table1, Table2 WHERE (Table1.MemoField=table2.MemoField);

Here is the same query in the query builder for those of you who prefer it to the SQL view:

Graphical display of query

Remove the join between the tables and add a criterion

Access will raise nary a complaint if you run this query which is logically equivalent to the one it abhorred. That’s all there is to it.

The Caveat

A final note. It is a definite database smell for an application to be joining tables on long text fields and will likely be the source of some performance issues in a database of non-trivial size. However, as was the case for the application I was tweaking, joining on long text fields is sometimes necessary in queries used for data clean-up, validation, or replication.  Still, use this type of join with caution avoiding it whenever possible.

Usability Hall of Shame – AutoCorrect

Cupertino, we have a problem.

Q: How do you know when your software has a usability problem?

A: When someone creates an entire humor site dedicated to mocking it.

Okay, you might have said something involving reports from the usability testing, or maybe even some axiomatic list of UX smells. However, if you ever find that a piece of software you wrote has inspired an entire new category of parodic humor, there just might be a problem.

How has it come to this?

The emergence of this UX gaffe leaves me incredulous for a number of reasons. First, one of the biggest offending implementations is baked into iOS, a product of a company that gets a lot of respect for designing great interfaces. Secondly, and more importantly, most of the Mobile OS platforms with this issue have gone through several major upgrades and not fixed the problem despite all the bad press about it.

Fail x 439,000

What’s the Problem?

So to reel this conversation back in, and make it instructional and not just another ranting voice on the Internet that the developers of this software will ignore, let’s talk about what it is about the auto-correct feature on your cell phone that makes it such a usability disaster.

The core problem is that it violates the usability principle that the user should always feel like they are in control of the software.

The users of your software should be the drivers, not the passengers. Ask Toyota how their ‘users’ felt about their products deciding independently that more speed might be nice .

If you are going to make your spelling checker jump in and autonomously decide it knows what the typist meant to say, it had better be right 100% of the time.  Is that bar too high? Well then stop making your app take over the controls. A well behaved app is heard and not stomping all over my carefully written prose and/or review of the wing place where I am currently receiving inferior service.

Not only does this implementation of auto-correct make the user feel impotent, it also violates the usability commandment that the user’s input is sacred. Violations of this principle contribute mightily to the computer-phobic person’s notion that the machine is going all SkyNet on them.

Extra Credit: Why do you think applications like MS Office that move your menu items around by usage frequency so infuriating? How about browser pop-ups?

Reinventing the Wheel – Let’s try square this time!

This would all be more excusable if a perfectly user-friendly auto-correct interface didn’t already exist. What’s more aggravating is that the existing workable system is so close to the new broken version that it is inconceivable that the designers of the new version weren’t familiar with the existing one.

Consider for a moment how auto-correct is implemented in MS Office, Blackberry OS, and the Firefox spell-checker. In almost identical fashion it gives you a list of suggested alternatives for the suspected misspelled word.  The key difference is that it SUGGESTS, but doesn’t CORRECT the user’s text unless the user takes affirmative action. This puts the user fully in control of the system, which makes them happy. Sure some typos may slip by the user, but if you are giving good enough visual indicators that should be minimized.

If only auto-correct had an anti-troll feature...

Is Anyone Listening?

Perhaps I’ve completely missed the point here, and there is a good reason why something that was routine on my old Blackberry is nigh impossible on the other smart phone platforms. If you have any inside scoop, or just have a convincing theory to share I’d love to hear about it in the comments. If you just want to complain about it, that’s fine too, maybe we can make this square wheel squeaky enough to get some satisfaction.

Concatenating Strings – You’re still doing it wrong.

A common anti-pattern I see from novice programmers is the tendency to read a coding tip somewhere, assume it to be a universal truth, and immediately start applying it everywhere in their code without fully understanding it. Usually the coding tip relates to optimization and is interpreted by the coder as “X is faster than Y, so always do Y instead of X.” This fallacy is particularly rampant with respect to the different approaches for concatenating strings.

Car with Wing

But it works on race cars....

I’m not writing this article to chastise the fledgling programmer who has fallen into this trap, nor is this intended as a how-to article on optimizing your code. Heaven knows there the internet is lousy with articles about the most efficient way to cram strings together. I will address the problems associated with some of the mythology about string concatenation, but my primary goal will be to encourage critical thinking and healthy skepticism for silver bullet programming techniques.

The Symptom

Though I possess no paranormal super-powers, I do believe I can read the mind of another human when I see code that looks like this:

    StringBuilder WelcomeMessage = new StringBuilder();
    WelcomeMessage.Append("Hello ");
    WelcomeMessage.Append(firstName);
    WelcomeMessage.Append(" ");
    WelcomeMessage.Append(LastName);
    WelcomeMessage.Append("!\n")

My spirit guide informs me that that programmer responsible for this code remembers reading somewhere that <insert programming language> is really inefficient at concatenating strings, but you can overcome that limitation by using the StringBuilder class. Based on this information he/she replaces every string concatenation operator with this clever technique leaving a scent trail through the code that experienced programmers can smell from miles away.

Problem 1: Premature Optimization is Procrastination

Sure, you want ALL of your code to perform well, but experienced programmers understand that their time is valuable and best spent on activities that deliver actual business value. Premature optimization and its ugly cousin Micro Optimization are almost always a waste of time. I understand how tempting it is to justify in your head that you could squeeze a little bit more performance out of your app by re-factoring the whole thing you learned in the blog post you read today, especially since it can be like a mini-vacation for your mind from the really complex issues you really should be working on, but be strong and resist!

As a rule of thumb: If it isn’t worth creating a jig to profile the performance gains you expect to get from optimization re-factoring, then it isn’t worth the time to do the re-factoring in the first place, plus it is risky because you won’t notice that your supposed “optimization” actually hurt performance.

More on that later.

Problem 2: Cookie Cutter Optimizations Assume the Compiler is Stupid

If you could universally make string concatenation faster by applying a simple formula then the compiler would probably already be applying the transformation anyway. Granted, I think this point is lost on some novice programmers who only have experience in higher level languages.

For them, I’ll clarify with this point.

Your program is running the compiler’s interpretation of your code. Not your actual code.

With that in mind, to think that the StringBuilder approach always runs faster would require you to believe that the people who wrote the programming language were smart enough to make string concatenation fast when they created the StringBuilder class, but forgot how to do it when they built the concatenation operator.

Pop Quiz: Does this code give you any heartburn? Why?

    string querySQL = "SELECT * " +
                      "FROM myTable " +
                      "WHERE (ID=5)";

If you said yes because it isn’t worth incurring the cost of concatenation for code readability then you aren’t giving the compiler enough credit.

Here is the MSIL output for the above statement:

ldstr      "SELECT * FROM myTable WHERE (ID=5)"

Amazing, huh?

Compilers are written to do the complex task of reading your code and interpreting what it means. Figuring out that a series of constant strings can be combined is child’s play.

Problem 3: If you don’t understand it, you’ll do it wrong

Cargo Cult programming is a derisive term for doing things in your program because you think you need to, but don’t understand (or have a vague notion of)  the underlying reason. It is really bad practice to adopt a technique without asking enough “why” questions to grasp the reason using it is desirable.

As an example, let’s dissect the premise that string concatenation using operators is slow and should be replaced by StringBuilders.

Q: Why do some claim that string concatenation with operators is slow?
In many garbage collected languages (Java/.NET) string objects are  immutable, meaning you can’t change them. So when you append more content into an existing string the program must internally create a new string and copy the old and new contents into it. The extra effort to create, destroy and garbage collect the extra string objects has the potential to create more work for your program and can degrade performance if done excessively.

Q: How does the StringBuilder help?
The StringBuilder class is implemented as a mutable memory buffer that typically has extra unused space allocated so that concatenations can be made in place without the need to create extra objects to juggle the data.

Q: How much extra space does it reserve? What happens if I append more content than will fit in the unused space?
By default (in .NET) 16 characters, unless you specify differently in the constructor. If you append more data than there is space, the StringBuilder will behave much like a String object creating a new StringBuilder object with double the existing capacity then copy over the data.

You: Wait, what?

You mean that you have been using StringBuilder with the default constructor and then appending more than 16 characters to it?

Yeah, well if you are lucky you’ll be no worse off than if you just used the “evil” string concatenation operators. However, due to that neat capacity doubling side-effect, your program might actually be locking up unnecessarily large chunks of memory on top of the additional work required to wrangle all the intermediary objects. Perhaps it is worth investigating and setting the initial capacity of that StringBuilder to avoid such nastiness.

Bonus: Now that you understand the potential performance benefit is based (at least partly) on mutability, you will see that other string optimization opportunities may exist whenever existing strings need to be modified, not just appended to.

Final Thoughts

Again, the point of all this isn’t about strings, or optimization, or any of that. It is about taking the time to understand what you are doing to avoid falling prey to the potentially harmful myths that are enthusiastically passed around by programmers (see also “The database sorts by the clustered index if you don’t specify an Ordering“).

In any event, I’m curious as to how many of my readers actually have at one time subscribed to the cargo cult programing meme of “StringBuilder is always better”. Please let me know in the comments.