I’ve been tinkering with small functions on my own time, trying to find ways to refactor them (I recently read Martin Fowler’s book Refactoring: Improving the Design of Existing Code). I found the following function MakeNiceString() while updating another part of the codebase near it, and it looked like a good candidate to mess with. As it is, there’s no real reason to replace it, but it’s small enough and does something small so it’s easy to follow and yet still get a ‘good’ experience from.
private static string MakeNiceString(string str) { char[] ca = str.ToCharArray(); string result = null; int i = 0; result += System.Convert.ToString(ca[0]); for (i = 1; i <= ca.Length - 1; i++) { if (!(char.IsLower(ca[i]))) { result += ' '; } result += System.Convert.ToString(ca[i]); } return result; } static string SplitCamelCase(string str) { string[] temp = Regex.Split(str, @'(?<!^)(?=[A-Z])'); string result = String.Join(' ', temp); return result; }
The first function MakeNiceString() is the function I found in some code I was updating at work. The purpose of the function is to translate ThisIsAString to This Is A String. It’s used in a half-dozen places in the code, and is pretty insignificant in the whole scheme of things.
I built the second function purely as an academic exercise to see if using a regular expression would take longer or not.
Well, here are the results:
With 10 Iterations:
MakeNiceString took 2649 ticks SplitCamelCase took 2502 ticks
However, it changes drastically over the longhaul:
With 10,000 Iterations:
MakeNiceString took 121625 ticks SplitCamelCase took 443001 ticks
Refactoring MakeNiceString()
The process of refactoring
MakeNiceString()started with simply removing the conversions that were taking place. Doing that yielded the following results:
MakeNiceString took 124716 ticks ImprovedMakeNiceString took 118486
Here’s the code after Refactor #1:
private static string ImprovedMakeNiceString(string str) { //Removed Convert.ToString() char[] ca = str.ToCharArray(); string result = null; int i = 0; result += ca[0]; for (i = 1; i <= ca.Length - 1; i++) { if (!(char.IsLower(ca[i]))) { result += ' '; } result += ca[i]; } return result; }
Refactor#2 – Use StringBuilder
My second task was to use
StringBuilderinstead ofString. SinceStringis immutable, unnecessary copies were being created throughout the loop. The benchmark for using that is below, as is the code:
static string RefactoredMakeNiceString(string str) { char[] ca = str.ToCharArray(); StringBuilder sb = new StringBuilder((str.Length * 5 / 4)); int i = 0; sb.Append(ca[0]); for (i = 1; i <= ca.Length - 1; i++) { if (!(char.IsLower(ca[i]))) { sb.Append(' '); } sb.Append(ca[i]); } return sb.ToString(); }
This results in the following Benchmark:
MakeNiceString Took: 124497 Ticks //Original SplitCamelCase Took: 464459 Ticks //Regex ImprovedMakeNiceString Took: 117369 Ticks //Remove Conversion RefactoredMakeNiceString Took: 38542 Ticks //Using StringBuilder
Changing the for loop to a foreach loop resulted in the following benchmark result:
static string RefactoredForEachMakeNiceString(string str) { char[] ca = str.ToCharArray(); StringBuilder sb1 = new StringBuilder((str.Length * 5 / 4)); sb1.Append(ca[0]); foreach (char c in ca) { if (!(char.IsLower(c))) { sb1.Append(' '); } sb1.Append(c); } return sb1.ToString(); }
RefactoredForEachMakeNiceString Took: 45163 Ticks
As you can see, maintenance-wise, the foreach loop will be the easiest to maintain and have the ‘cleanest’ look. It is slightly slower than the for loop, but infinitely easier to follow.
Alternate Refactor: Use Compiled Regex
I moved the Regex to right before the loop is begun, in hopes that since it only compiles it once, it’ll execute faster. What I found out (and I’m sure I have a bug somewhere) is that that doesn’t happen like it ought to:
static void runTest5() { Regex rg = new Regex(@'(?<!^)(?=[A-Z])', RegexOptions.Compiled); for (int i = 0; i < 10000; i++) { CompiledRegex(rg, myString); } } static string CompiledRegex(Regex regex, string str) { string result = null; Regex rg1 = regex; string[] temp = rg1.Split(str); result = String.Join(' ', temp); return result; }
Final Benchmark Results:
MakeNiceString Took 139363 Ticks SplitCamelCase Took 489174 Ticks ImprovedMakeNiceString Took 115478 Ticks RefactoredMakeNiceString Took 38819 Ticks RefactoredForEachMakeNiceString Took 44700 Ticks CompiledRegex Took 227021 Ticks
Or, if you prefer milliseconds:
MakeNiceString Took 38 ms SplitCamelCase Took 123 ms ImprovedMakeNiceString Took 33 ms RefactoredMakeNiceString Took 11 ms RefactoredForEachMakeNiceString Took 12 ms CompiledRegex Took 63 ms
So the percentage gains are:
MakeNiceString 38 ms Baseline SplitCamelCase 123 ms 223% slower ImprovedMakeNiceString 33 ms 13.15% faster RefactoredMakeNiceString 11 ms 71.05% faster RefactoredForEachMakeNiceString 12 ms 68.42% faster CompiledRegex 63 ms 65.79% slower
(Please check my math)
In the end, I’m going to replace what’s there with the RefactoredForEachMakeNiceString() and while I’m at it, I’m going to rename it to something useful, like SplitStringOnUpperCase.
Benchmark Test:
To benchmark, I simply invoke a new Stopwatch for each method call:
string myString = 'ThisIsAUpperCaseString'; Stopwatch sw = new Stopwatch(); sw.Start(); runTest(); sw.Stop(); static void runTest() { for (int i = 0; i < 10000; i++) { MakeNiceString(myString); } }
Questions
- What causes these functions to be so different ‘over the long haul’, and
- How can I improve this function a) to be more maintainable or b) to run faster?
- How would I do memory benchmarks on these to see which used less memory?
Thank you for your responses thus far. I’ve inserted all of the suggestions made by @Jon Skeet, and would like feedback on the updated questions I’ve asked as a result.
NB: This question is meant to explore ways to refactor string handling functions in C#. I copied/pasted the first code
as is. I’m well aware you can remove theSystem.Convert.ToString()in the first method, and I did just that. If anyone is aware of any implications of removing theSystem.Convert.ToString(), that would also be helpful to know.
1) Use a StringBuilder, preferrably set with a reasonable initial capacity (e.g. string length * 5/4, to allow one extra space per four characters).
2) Try using a foreach loop instead of a for loop – it may well be simpler
3) You don’t need to convert the string into a char array first – foreach will work over a string already, or use the indexer.
4) Don’t do extra string conversions everywhere – calling Convert.ToString(char) and then appending that string is pointless; there’s no need for the single character string
5) For the second option, just build the regex once, outside the method. Try it with RegexOptions.Compiled as well.
EDIT: Okay, full benchmark results. I’ve tried a few more things, and also executed the code with rather more iterations to get a more accurate result. This is only running on an Eee PC, so no doubt it’ll run faster on ‘real’ PCs, but I suspect the broad results are appropriate. First the code:
Now the results:
As you can see, the string indexer version is the winner – it’s also pretty simple code.
Hope this helps… and don’t forget, there are bound to be other options I haven’t thought of!